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Abstract —In this paper, we consider the so-called uniquely 
decodahle one-to-one code (UDOOC) that is formed by inserting 
a “comma” indicator, termed the unique word (UW), between 
consecutive one-to-one codewords for separation. Along this re¬ 
search direction, we first investigate several general combinatorial 
properties of UDOOCs, in particular the enumeration of the 
number of UDOOC codewords for any (finite) codeword length. 
Based on the obtained formula on the number of length-n code¬ 
words for a given UW, the per-letter average codeword length of 
UDOOC for the optimal compression of a given source statistics 
can be computed. Several upper bounds on the average codeword 
length of such UDOOCs are next established. The analysis on the 
bonnds of average codeword length then leads to two asymptotic 
bonnds for sources having Infinitely many alphabets, one of which 
is achievable and hence tight for a certain source statistics and 
UW, and the other of which proves the achievabillty of source 
entropy rate of UDOOCs when both the block size of source 
letters for UDOOC compression and UW length go to infinity. 
Efficient encoding and decoding algorithms for UDOOCs are also 
given in this paper. Nnmerical resnlts show that when grouping 
three English letters as a block, the UDOOCs with UW = 0001 , 
0000, 000001 and 000000 can respectively reach the compression 
rates of 3 . 531 , 4 . 089 , 4 . 115 , 4.709 bits per English letter (with 
the lengths of UWs Included), where the source stream to be 
compressed is the book titXeA Alice’s Adventures in Wonderland. In 
comparison with the first-order Huffman code, the second-order 
Huffman code, the third-order Huffman code and the Lempel-Zlv 
code, which respectively achieve the compression rates of 3 . 940 , 
3 . 585 , 3.226 and 6.028 bits per single English letter, the proposed 
UDOOCs can potentially result in comparable compression rate 
to the Huffman code under similar decoding complexity and yield 
a smaller average codeword length than that of the Lempel-Ziv 
code, thereby confirming the practicability of UDOOCs. 

I. Introduction 

The investigation of lossless source coding can be roughly 
classified into two categories, one for the compression of a 
sequence of source letters and the other for a single “one shot” 
source symbol [7]. A well-known representative for the former 
is the Huffman code, while the latter is usually referred to as 
the one-to-one code (OOC). 

The Huffman code is an optimal entropy code that can 
achieve the minimum average codeword length for a given 
statistics of source letters. It obeys the rule of unique decod- 
ability and hence the concatenation of Huffman codewords 
can be uniquely recovered by the decoder. Although optimal 
in principle, it may encounter several obstacles in implemen¬ 
tation. For example, the rare codewords are exceedingly long 
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in length, thereby hampering the efficiency of decoding. Other 
practical obstacles include 

i) the codebook needs to be pre-stored for encoding and 
decoding, which might demand a large memory space 
for sources with moderately large alphabet size, 

ii) the decoding of a sequence of codewords must be done 
in sequential, not in parallel, and 

iii) erroneous decoding of one codeword could affect the 
decoding of subsequent codewords, i.e., error propaga¬ 
tion. 

In contrast to unique decodability, the OOC only requires an 
assignment of distinct codewords to the source symbols. It 
has been studied since 1970s [33] and is shown to achieve 
an average codeword length smaller than the source entropy 
minus a nontrivial amount of quantity called anti-redundancy 
[29]. Various research works over the years have shown that 
the anti-redundancy can be as large as the logarithm of the 
source entropy [1], [5], [6], [13], [21], [24], [26], [27], [29]- 
[31] In comparison with an entropy coding like Huffman code, 
the codewords of an OOC can be sequenced alphabetically and 
hence the practice of an OOC is generally considered to be 
more computationally convenient. 

A question that may arise from the above discussion is 
whether we could add a “comma” indicator, termed Unique 
Word (UW) in this paper, in-between consecutive OOC code¬ 
words, and use the OOC for the lossless compression of a 
sequence of source letters. A direct merit of such a structure 
is that the alphabetically sequenced OOC codewords can be 
manipulated without a priori stored codebook at both the 
encoding and decoding ends. This is however achieved at a 
price of an additional constraint that the “comma” indicator 
must not appear as an internal subword' in the concatenation 
of either an OOC codeword with a comma indicator, or a 
comma indicator with an OOC codeword. 

On the one hand, this additional constraint facilitates the 
fast identification of OOC codewords in a coded bit-stream 
and makes feasible the subsequent parallel decoding of them. 
On the other hand, the achievable average codeword length 
of a UW-forbidden OOC may increase significantly for a bad 
choice of UWs. Therefore, it is of theoretical importance to 
investigate the minimum average codeword length of a UW- 
forbidden OOC, in particular the selection of a proper UW 
that could minimize this quantity. Since the resultant UW- 
forbidden OOC coding system satisfies unique decodability 

’We say a = ai ... am is not an internal subword of 6 = bi ... if 
there does not exist i such that bi ... = a, for all 1 < i < n— 

When the same condition holds for alll<i<n — m-|-l (i.e., with two 
equalities), we say a is not a subword of b. 
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(UD), we will refer to it conveniently as the UDOOC in the 
sequel. 

We would like to point out that the conception of inserting 
UWs between consecutive words might not be new in existing 
applications. For example, in the IEEE 802.11 standard for 
wireless local area networks [3], an entity similar to the UW 
in a bit-stream has been specified as a boundary indicator 
for a frame, or as a synchronization support, or as a part 
of error control mechanism. In written English, punctuation 
marks and spacing are essential to disambiguate the meaning 
of sentences. However, a complete theoretical study of the 
UDOOC conception remains undone. This is therefore the 
main target of this paper. We now give a formal definition 
of binary UDOOCs. 

Definition 1: Given UW k = kik 2 ■ ■ ■ SFx ••• xF = 
F^, where F = {0,1}, we say Cfc(n) is a UDOOC of length 
n > 1 associated with k if it contains all binary length-n 
tuples b = bi .. .bn such that k is not an internal subword 
of the concatenated bit-stream kbk. As a special case, we set 
Cfc(O) := {null}.^ The overall UDOOC associated with k, 
denoted by Ck, is given by 

Cfe := IJ Ckin). 

n>0 


For a better comprehension of Definition 1, we next give 
an example to illustrate how a UDOOC is generated and how 
it is used in encoding and decoding. The way to count the 
number of length-n UDOOC codewords will follow. 

Example 1: Suppose the UW fe = 00 is chosen. In order to 
prohibit the concatenation of any UDOOC codeword and UW, 
regardless of the ordering, from containing 00 as an internal 
subword, the following constraints must be satisfied. 

• Type-I constraints: The UW cannot be a subword of any 
UDOOC codeword. This means that within any codeword 
of length n > 1: 

(Cl) “0” can only be followed by “1”. 

(C2) “1” can be followed by either “0” or “1”. 

• Type-II constraints: Besides the type-I constraints, the 
UW cannot appear as an internal subword, containing the 
boundary of any UDOOC codeword and UW, regardless 
of the ordering. This implies that except for the “null” 
codeword: 

(C3) The first bit of a codeword cannot be “0”. 

(C4) The last bit of a codeword cannot be “0”. 

By Constraints (C1)-(C4), we can place the UDOOC code¬ 
words on a code tree as shown in Fig. 1, in which each path 
starting from the root node and ending at a gray-shaded node 
corresponds to a codeword. Thus, the codewords for UW = 00 
include null, 1, 11, 101, 111, 1011, 1101, 1111, etc. It should 
be noted that we only show the codewords of length up to four, 
while the code tree actually can grow indefinitely in depth. 

At the decoding stage, suppose the received bit-stream is 
00100110010100111100, where we add UWs at both the 


^In our binary UDOOC, it is allowed to place two UWs side-by-side with 
nothing in-between in order to produce a null codeword. 



left and the right ends to indicate the margins of the bit- 
stream. This may facilitate, for example, noncoherent bit- 
stream transmission. Then, the decoder first locates UWs and 
parses the bit-stream into separate codewords as 1, 11, 101 
and nil, after which the four codewords can be decoded 
separately (possibly in parallel) to their respective source 
symbols. 

With the code tree representation, the number of length- 
n codewords in a UDOOC code tree can be straightforwardly 
calculated. Let the “nuU”-node be placed at level 0. For n > 1, 
denote by a„ and the numbers of “l”-nodes and “0”-nodes 
at the nth level of the code tree, respectively. By the two type-I 
constraints, the following recursions hold: 


ttn — ^n—1 T 1 

~ ^n—1 


for n > 2. 


With the initial values of oi = 1 and 02 = 1, it follows 
that is the renowned Fibonacci sequence [14], i.e., 

On = fln-i + (in -2 for u > 3. This result, together with the 
two type-II constraints, implies that the number of length-n 
codewords is |Coo(n.)| = a-n for n > 1, which according to 
the Fibonacci recursion is given by: 

(p" - 

a„ = — 

where (p = is the Golden ratio and p = is the 

Galois conjugate of p in number field Q(-\/5). Thus, |Coo('n-)| 
grows exponentially in n with base ip ss 1.618. ■ 


We can similarly examine the choice of UW = 01 and draw 
the respective code tree in Fig. 2, where its type-I constraints 
become: 

(Cl) “0” can only be followed by “0”. 

(C2) “1” can be followed by either “0” or “1”. 
and no type-II constraints are required. We then obtain 


^n—1 

fln-l + bn-l 


for n >2, 


and |Coi(n)| = a„ -f = n -f 1. Although from Figs. 1 and 
2, taking UW = 01 seems to provide more codewords than 
taking UW = 00 at small n, the linear growth of |Coi(n)| with 
respect to codeword length n suggests that such choice is not 
as good as the choice of UW = 00 when n is moderately 
large. 
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The above two exemplified UWs point to an important fact 
that the best UW, which minimizes the average codeword 
length, depends on the code size required. Thus, the inves¬ 
tigation of the efficiency of a UW may need to consider the 
transient superiority in addition to claiming the asymptotic 
winner. 

In this paper, we provide efficient encoding and decod¬ 
ing algorithms for UDOOCs, and investigate their general 
combinatorial properties, in particular the enumeration of the 
number of codewords for any (finite) codeword length. Based 
on the obtained formula for \Ck{n)\, i.e., the number of 
length-n codewords for a given UW k, the average codeword 
length of the optimal compression of a given source statistics 
using UDOOC can be computed. Classifications of UWs are 
followed, where two types of equivalences are specified, which 
are (exact) equivalence and asymptotic equivalence. UWs that 
are equivalent in the former sense are required to yield exactly 
the same minimum average codeword length for every source 
statistics, while asymptotic equivalence only dictates the UWs 
to result in the same asymptotic growth rate as codeword 
length approaches infinity. Enumeration of the number of 
asymptotic equivalent UW classes are then studied with the 
help of methodologies in [17] and [25]. Furthermore, three 
upper bounds on the average codeword length of UDOOCs 
are established. The first one is a general upper bound when 
only the largest probability of source symbols is given. The 
second upper bound refines the first one under the premise 
that the source entropy is additionally known. When both 
the largest and second largest probabilities of source symbols 
are present apart from the source entropy, the third upper 
bound can be used. Since these bounds are derived in terms 
of different techniques, actually none of the three bounds 
dominates the other two for all statistics. Comparison of these 
bounds for an English text with statistics from [36] and that 
with statistics from the book Alice’s Adventures in Wonderland 
will be accordingly provided. The analysis on bounds of the 
average codeword length gives rise to two asymptotic bounds 
on ultimate per-letter average codeword length, one of which 
is tight for a certain choice of source statistics and UW, and 
the other of which leads to the achievability of the ultimate 
per-letter average codeword length to the source entropy rate 
when both the source block length for compression and UW 
length tend to infinity. 

It may be of interest to note that the enumeration of the 
number of codewords, i.e., Ck,n = \Ck{n)\, is actually obtained 
indirectly via the determination of an auxiliary quantity „, 
which is the number of words satisfying the type-I constraints 
but not necessarily the type-II constraints. By utilizing the 
Goulden-Jackson cluster method [16], [20], [22], [23], [32], 
an explicit formula for „ can be established. The desired 
enumeration formula for the number of length-n UDOOC 
codewords is then obtained by proving that both the so-called 
linear constant coefficient difference equation (LCCDE) and 
the asymptotic growth rate of Sk,n and Ck,n are identical. 
We next show based on the obtained formula that the all¬ 
zero UW has the largest asymptotic growth rate among all 
UWs of the same length, while the UW with the smallest 
growth rate is 00... 01. Interestingly, the all-zero UW is 


often the one that yields the smallest Cfe „ for small n, in 
contrast to UW 00... 01, whose Ck,n tops all other UWs 
when n is small. We afterwards demonstrate by using these 
two special UWs that the general encoding and decoding 
algorithms can be considerably simplified when further taking 
into consideration the structure of particular UWs. A side 
result from the enumeration of Cfc „ is that for all UWs, the 
codeword growth rate of UDOOCs will tend to |F| = 2 as the 
length of the UW goes to infinity. 

With regard to the compression performance of the proposed 
UDOOCs, numerical results show that when grouping three 
English letters as a block and separating the consecutive blocks 
by UWs, the UDOOCs with UW = 0001, 0000, 000001 
and 000000 can respectively reach the compression rates of 
3.531, 4.089, 4.115, 4.709 bits per English letter (with the 
length of UWs included), where the source stream to be com¬ 
pressed is the book titled Alice’s Adventures in Wonderland. 
In comparison with the first-order Huffman code, the second- 
order Huffman code, the third-order Huffman code^ and the 
Lempel-Ziv code, which respectively achieve the compression 
rates of 3.940, 3.585, 3.226 and 6.028 bits per English letter, 
the proposed UDOOCs can potentially result in comparable 
compression rate to the Huffman code under similar decoding 
complexity and yield a smaller average codeword length 
than that of the Lempel-Ziv code, thereby confirming the 
practicability of the scheme of separating OOC codewords by 
UWs. 

In the literature, there are a number of publications on 
enumeration of words in a set that forbids the appearance of a 
specific pattern [8]-[12]. For example, Doroslova investigated 
the number of binary length-n words, in which a specific 
subword like 1010 ... 10 is not allowed [10]. He then extended 
the result to non-binary alphabet and forbidden subwords 
of length 3 [9], [12], and forbidden subwords of length 4 
[11], as well as the so-called “good” forbidden subwords 
[8]. The analyses in [8]-[12] however depend on the specific 
structure of forbidden subwords considered, and no asymptotic 
examination is performed. On the other hand, algorithmic 
approaches have been devoted to a problem of similar (but not 
the same) kind, one of which is called the Goulden-Jackson 
clustering method [16], [20], [22], [23], [32]. 

Instead of enumerating the number of words internally 
without a forbidden pattern, some researchers investigate the 
inherent characteristic of such patterns. In this literature. Rivals 
and Rahmann [25] provide an algorithm to account for the 
number of overlaps'^ for a given set of patterns, for which the 
definition will be later given in this paper for completeness 
(cf Definition 4). Different from the algorithmic approach in 
[25], Guibas and Odlyzko established upper and lower bounds 


fcth-order Huffman code maps a block of k source letters onto a variable- 
length codeword. 

“^In [17] and [25], the authors actually use a different name “autocorrelation” 
for “overlap” originated from [16]. Specifically, they define the autocorrela¬ 
tion V = vi ■■■ vj^ of a binary length-L string u = ui ■ ■ ■ uj^ as a binary 
zero-one bit-stream of length L such that = 1 if i is a period of it, where i 
is said to be a period of u when Uj = Uij^j for every f < j < L — i. Since 
the term autocorrelation is extensively used in other literature ilke digital 
communications to illustrate similar but different conception, we adopt the 
name of “overlap” in this paper. 
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for the number of overlaps when the length of the concerned 
pattern goes to infinity [17]. 

The rest of the paper is organized as follows. In Section 
II, construction of general UDOOCs is introduced. In Sec¬ 
tion III, combinatorial properties of UDOOCs, including the 
enumeration of the number of codewords, are derived. In 
Section IV, the encoding and decoding algorithms as well 
as bounds on average codeword length for general UDOOCs 
are provided and discussed. In Section V, numerical results 
on the compression performance of UDOOCs are presented. 
Conclusion is drawn in Section VI. 


Define the 2^ ^-by-2^ ^ adjacency matrix for the 
digraph Gk by putting its {i -f 1, j + l)th entry as 


(Afe) 


t-i-i.i-i-i 


1, if {i,j)eEk, 
0, otherwise. 


( 2 ) 


where we abuse the notation by using i (resp. j) to be 
the integer corresponding to binary representation of i = 
ii ... iL-i (resp. j = ji ... jL-i) with the leftmost bit being 
the most significant bit. As an example, for k = 010, we have 
U = F2 = {00,01,10,11}, 


II. Construction of UDOOCs 
I n the previous section, we have seen that the code tree of a 
UDOOC with UW = 00 (or UW = 01) is by far a useful tool 
for devising its properties. Along this line, we will provide 
a systematic construction of code tree for general UDOOC 
in this section. Specifically, a digraph [2] whose directional 
edges meet the type-I and type II constraints^ from the UW 
will be first introduced. By the digraph, the construction of 
a general UDOOC code tree as well as the determination of 
the growth rate of UDOOC codewords with respect to the 
codeword length will follow. 


£^010 = 1 ( 00 , 00 ), ( 00 , 01 ), ( 01 , 11 ), 

( 10 , 00 ),( 10 , 01 ),( 11 , 10 ),( 11 , 11 ) 1 , 

Golo = {V,Eoio) in Fig. 3, and 


Aoio — 


110 0 
0 0 0 1 
110 0 
0 0 11 


We remark that the adjacency matrix will be used for enu¬ 
merating the number of UDOOC codewords in next section. 


A. Digraphs for UDOOCs 

Let k = ki ... kL f>e the chosen UW of length L. Denote 
by Gk = (y, Ek) the digraph for the UDOOC with UW = fc, 
where V = is the set of all binary length-(L — 1) tuples, 
and Ek is the set of directional edges given by 

Ek ■■= {(*, j) e = jf and fj f fc} . (1) 

Here, we use the conventional shorthand i\ = isis+i. ■ .it to 
denote a binary string from index s to index t, and the elements 
in V are interchangeably denoted by either i = ii... or 
z{^“^, depending on whichever is more convenient. 


^For clarity of its explanation, we introduce the so-called type-I and type-II 
constraints in Example 1. Listing these constraints for a general UW however 
may be tedious and less comprehensive. As will be seen from this section, 
these constraints can actually be absorbed into the construction of the digraph 
(See specifically Eq. (1)); hence, explicitly listing of constraints becomes of 
secondary necessity. 


B. Code Trees for UDOOCs 

Equipped with digraph Gk, constructing the code tree for 
the UDOOC with UW = k becomes straightforward. Recall 
that a UDOOC codeword of length rz is a binary n-tuple 
b — bi... bn, satisfying that k is not an internal subword of 
the concatenated bit-stream kbk. As such, the traversal of the 
digraph for constructing a UDOOC code tree should start from 
the vertex k^ € V, which corresponds to the initial “null”- 
node in the code tree. Next, a “0”-node at level 1 is generated 
if both (^2 € Ek and jl-i = 0 are satisfied. By the 

same rule, the “nuU”-node is followed by a ‘T”-node at level 1 
if {k 2 , 3 i~^) € Ek and jl-i = 1- We then move the current 
vertex to and draw a branch from “jT-i”-node at level 1 
to a followup “0”-node (resp. ‘T”-node) at level 2 in the code 
tree if G Ek and Il-i = 0 (resp. £l_i = 1). 

We move the current vertex again to and re-do the above 
procedure to generate the nodes in the next level. Repeating 
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this process will complete the exploration of the nodes in the 
entire code tree. 

Determination of the gray-shaded nodes that end a codeword 
can be done as follows. Since k cannot be an internal subword 
of kbk, a node should be gray-shaded if it is immediately 
followed by a sequence of offspring nodes with their binary 
marks equal to fci ... The construction of the UDOOC 
code tree is accordingly finished. 

As an example, we continue from the exemplified UW 
fc = 010 with digraph Gk in Fig. 3 and explore its respective 
UDOOC code tree in Fig. 4 by following the previously 
mentioned procedure. By starting from the vertex = 10 
that corresponds to the “nulF’-node, two succeeding nodes 
are generated since both (10,00) and (10,01) are in F^oio 
(cf. Fig. 4). Now from vertex 00 that corresponds to the “0”- 
node at level 1, we can reach either vertex 00 or vertex 01 
in one transition; hence, both “0”-node and “l”-node are the 
succeeding nodes to the “0”-node at level 1. However, since 
vertex 01 can only walk to vertex 11 in one transition, the “1”- 
node at level 1 has only one succeeding node with mark “1.” 
Continuing this process then exhausts all the nodes in the code 
tree in Fig. 4. Next, all nodes that are followed by kik 2 = 01 
in sequence in the code tree are gray-shaded. The construction 
of the code tree for the UDOOC with UW k = 010 is then 
completed. 

We end this section by giving the type-I and type-II con¬ 
straints for the exemplified code tree as follows. 

• Type-I constraints: 

(Cl) “0” can be followed by either “0” or “1”. 

(C2) “1” can be followed by “0” only when the node 
prior to this ‘T”-node is not a “0”-node. 

• Type-II constraints: 

(C3) The first two bits of a UDOOC codeword 
cannot be “10.” 

(C4) The last two bits of a UDOOC codeword cannot 
be “01.” 

Note that with these constraints (in particular (C4)), one can 
also perform the node-shading step by first gray-shading all the 
nodes in the code tree, and then unshade those that end with 
“01” (in addition to the “l”-node at level 1 for this specific 
UW). Nevertheless, it may be tedious to perform the node¬ 
unshading for a general UW. For example, when UW k = 
01001, all nodes that end a codeword 6", satisfying either 
bn- 3 bn- 2 bn-ibnki = k ov 5 „_ 26 „-i 6 „fcifc 2 = k, should be 
unshaded. This confirms the superiority of constructing the 
UDOOC code tree in terms of the digraph over analyzing the 
explicit listing of constraints from the adopted UW that are 
perhaps convenient only for some special UWs. 

III. Combinatorial Properties of UDOOCs 
A. The Determination of \Ck{n)\ 

In this subsection, we will see that the conception of digraph 
Gk, in particular its respective adjacency matrix A^, can lead 
to a formula for the number of length-n codewords, i.e., Cfc „ = 
\Ck{n)\. 

In accordance with the fact that the traversal of the digraph 
for constructing a UDOOC code tree should start from vertex 


n=0 n=1 n=2 n=3 n=4 



Fig. 4. Code tree for the UDOOC with UW k = 010. 


€ V, we define a length-2^“^ initial vector as {xk)j+i = 1 
if integer j has the binary representation k^, and {xk)j+i = 0, 
otherwise, for 0 < j < 2^“^. It then follows that the (f-f l)th 
entry of row vector gives the number of length-n walks 

that end at vertex £ on digraph Gk, where denotes the 
vector/matrix transpose operation, and £ = £i ... £l-i is the 
binary representation of integer index £. 

However, not every length-n walk produces a codeword. 
Notably, some nodes on the code tree will be gray-shaded and 
some will not. Recall that k cannot be an internal subword 
of kbk if b — bib 2 ... bn is a codeword. This implies that b 
is a length-n codeword if, and only if, the vertex sequence 
fc|', fcg 5i, k^bf, ..., bnk^~‘^, k^~^ is a valid walk of length 
n + L—1 on digraph Gk ■ As a result, the number of length-n 
codewords equals the number of length-(n + L — 1) walks 
from vertex k^ to vertex on digraph Gk- Following the 
above discussion, we define the length-2^“^ ending vector 
y as {y^fj+i = 1 if integer j has the binary representation 
ki and {y^)j+i = 0, otherwise, for 0 < j < 2^~^. Then, 
the number of length-n codewords is given by 

Cfc.n := |Cfe(n)| = (3) 

B. Equivalence among UWs 

Two UWs that result in the same minimum average code¬ 
word length for every source statistics should be considered 
equivalent. This leads to the following definition. 

Definition 2: Two UWs k and fe' are said to be equivalent, 
denoted by fc = k!, if the numbers of their length-n codewords 
in the corresponding UDOOCs are the same for all n, i.e., 

Ck,n = Ck'.n for all n > 0. (4) 

By this definition, UDOOCs associated with equivalent 
UWs have the same number of codewords in every code 
tree level; hence they achieve the same minimum average 
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codeword length in the lossless compression of a sequence 
of source letters. This equivalence relation allows us to focus 
only on one UW in every equivalent class. It is however hard to 
exhaust and identify all equivalent classes of UWs of arbitrary 
length. Instead, we will introduce a less restrictive notion 
of asymptotic equivalence when the asymptotic compression 
rate of UDOOCs is concerned, and derive the number of all 
asymptotically equivalent classes of UWs in Section III-E. 

Some properties about the (exact) equivalence of UWs are 
given below. 

Proposition 1 (Equivalence in order reversing): UW k' = 
kL ■ ■ - ki is equivalent to UW k = fcife • ■ • kL- 

Proof: It follows simply from that b = &162 ■ • ■ € C;, 

if, and only if, b' = .. .hi € Cy. ■ 

Proposition 2 (Equivalence in binary complement): If k is 
the bit-wise binary complement of fe, then k and k are 
equivalent. 

Proof: It is a consequence of the fact that the concatenated 
bit-stream kbk contains k as an internal subword if, and 
only if, the binary complement kbk of kbk contains k as 
an internal subword. ■ 

From Propositions 1 and 2, it can be verified that there 
are at most four equivalent classes for UWs of length L = 4. 
Representative UWs for these four equivalent classes are 0000, 
0001 , 0100 and 0101 , respectively. 

C. Growth Rates of UDOOCs 

In this subsection, we investigate the asymptotic growth rate 
of UDOOCs, of which the definition is given below. 

Definition 3: Given UW k, the asymptotic growth rate of 
the resulting UDOOC is defined as 

gk ■■= lim (5) 

n^oo Cfc,„ 

By its definition, the asymptotic growth rate of a UDOOC 
indicates how fast the number of codewords grows as n 
increases. 

It is obvious that gk < for all UWs because the 
upper bound of 2 is the growth rate for unconstrained binary 
sequences of length n. In addition, the limit in (5) must exist 
since it can be inferred from enumerative combinatorics [28], 
and also from algebraic graph theory [4], that gk is the largest 
eigenvalue of adjacency matrix A^.. In the next proposition, 
we show that the largest eigenvalue of adjacency matrix is 
unique for all UWs but fe = 01. 

Proposition 3 (Uniqueness of the largest eigenvalue of kk): 
For any UW fe of length L > 2 except fe = 01, the largest 
eigenvalue of adjacency matrix kk is unique and is real. 

Proof: By Perron-Frobenius theorem [15] [19], the largest 
eigenvalue of adjacency matrix kk is unique and real with 
algebraic multiplicity equal to 1 if Gk is a strongly connected 
diagrph. Thus, we only need to show that Gk is a strongly 
connected digraph except for fe = 01 . 

We then argue that Gk is a strongly connected diagrph when 
L > 3 as follows. According to the definition of Ek in (1), 



Fig. 5. Digraph Gk for UW k = 01. 


the only situation that a vertex may not be strongly connected 
to other vertex is when j = k 2 k^ - ■ ■ kL- This however cannot 
happen when L > 3 because vertex kik 2 • • • fcL-i will connect 
strongly to fe 2 A 3 • • • fci- The proof is completed after verifying 
the two cases for L = 2, i.e., Gqo is strongly connected but 
Gqi is not. ■ 


The digraph for fe = 01 is plotted in Fig. 5. It clearly 
indicates that there is no directed path from vertex 0 to vertex 
1. In fact, the algebraic multiplicity of the largest eigenvalue 
1 of Aqi is two. 

By the standard technique of using an indeterminate z in 
enumerative combinatorics, we can enumerate the numbers 

^S 


n —0 

Vn=0 / 
xl (I - kkZ)~^ 
xladj (I - kkz) k^-^y^ 

det (I - kkz) ’ ^ 

where the first equality follows from (3) and I denotes the 
identity matrix of proper size. Equation (6) then implies that 
det (l — Afcz) can give a linear recursion of Ck,n in the form 
of a linear constant coefficient difference equation (FCCDE). 

Now let Ai,..., Am be distinct nonzero eigenvalues of 
adjacency matrix kk with algebraic multiplicities ei,... ,em, 
respectively, where we assume with no loss of generality that 
I All > ••• > I Am I- In terms of the standard technique of 
partial fraction for rational functions, we can rewrite (6) as 


^ ^ t^k,nZ' — 
n—0 


xladjjl-kkz)kj^ Vfc ^ ^ pfiz) 

det(l-Afcz) ^(l-\z)^i 

1—1 

for some polynomials pfiz). The next step is expectantly to 
rewrite the righ-hand-side (RHS) of (7) as a power series of 
indeterminate z in order to recover the actual values of Ck,n 
for all n. As an example, this can be done by 


(1 - A,z)^‘ 


E 

n—0 


n + Ci — 1 
n 


A"z" 
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which holds for all \z\ < mini<i<m 

Although the asymptotic growth rate equals exactly the 
largest eigenvalue of adjacency matrix A^, it is in general 
difficult to find a closed-form expression for this value without 
a proper reshaping of adjacency matrix Afc. Another approach 
is to consider the following set for n > L, 

Skin) := {b G F" : fc is not a subword of 6 } , ( 8 ) 


which, in a way, defines the set of distinct length-n walks on 
digraph Gk- Denoting Sk,n '■= 15 ^( 71)1 and by an argument 
similar to ( 6 ), one can easily show that 


OO 

n—0 


L-1 


n—0 


^l^adj (I - Afcz) 1 
det (l — Afcz) ’ 


(9) 


where 1 is the all-one column vector of appropriate length. 
Equation (9) then implies that the enumeration of Sk,n also 
depends upon the polynomial det(l —A^z) as Ck,n does. 
Based on this observation, we can infer and prove that „ 
has the same asymptotic growth rate as Ck,n- We summarize 
this important inference in the proposition below, while the 
proof will be relegated to the next subsection. 


Proposition 4: For any UW fc, sequences {ck,n}'^=o 
{sk,n}'^=o have the same asymptotic growth rate, i.e.. 


9k — Qk^ 

where 

,. tik,n+l 
0 fc := hm -. 

n-^oo Sk,n 

Notably, in order to distinguish the asymptotic growth rate 
of Sk.n from that of Ck,n, a different font gk is used to denote 
the asymptotic growth rate of Sk,n- 


D. Enumeration of Sk,n 

Enumerating „ turns out to be easier than enumerating 
Cfe „ due to that there is lesser number of constraints on the 
sequences in Sk{n). It can be done by an approach similar to 
the Goulden-Jackson clustering method [23]. Before delivering 
the main theorems, we define the overlap function and overlap 
vector of a binary stream fc as follows. 


Definition 4: For a given fc of length L, its overlap function 
is defined as 


rk{i) 


1 if = ki * and 0 < i < L — 1 
0 otherwise. 


( 10 ) 


Furthermore, we define its length-L overlap vector as = 
rkO - 1 ) for j = 1 ... F. 


Theorem 1: For a length-L UW fc with overlap function 

rk{i). 


n>0 


_ 1 + rk{i)z^ 

(1 - 2z) (l -f Y.f=i rk{i)z^ 



Moreover, let hk{z) denote the denominator of (11), i.e.. 


hk{z) = {l-2z){^ + ^jk{i)z^ + z^. ( 12 ) 


Then 

hk{z) = det(l - kkz), (13) 

where A/, is the adjacency matrix associated with digraph Gk- 
Proof: The result (11) follows from the Goulden-Jackson 
clustering method [23]. For completeness, a simplified proof to 
this claim is provided in Appendix A. To establish the second 
claim, i.e., (13), we combine (9) and (11) to give 

1 + rk{i)z^ ^ f{z) 

(1 - 2z)il + rfc(z)z*) + det(l - A^z) 

for some polynomial f{z). Notice that the left-hand-side 
(FHS) is an irreducible rational function in z. Furthermore, 
Proposition 10 in Appendix B shows degdet(l — A^z) = L. 
These then imply that 

det(l-Afcz) = hk{z) 

and 

L-l 

f{z) = l + '^rk{i)z\ 

i=l 

(13) is thus established. ■ 


The next example illustrates the usage of the above theorem 
to the target result of Proposition 4. 

Example 2: Consider the case of fc = 000. Then from (10), 
the corresponding overlap function is 


1 , i = 0 ,l, 2 , 
0 , otherwise. 


rom(i) = 

Substituting the above into (11), we obtain 

sooo, 


= 


n>0 


1 -f z -f z^ 

1 - Z - z2 - z3 


By regarding the above as an FCCDE, we conclude that the 
sequence Sooo.n satisfies the following recursion for all n > 0 : 

•SOOO.n = 5000,n-1 + 'S000,n-2 + SOOO.n-a + + Sn-1 + 5n-2, 

where 

Sn=\^' 

I 0 , otherwise 


is the Kronecker delta function. 


Equipped with Theorem 1, we are now ready to prove 
Proposition 4. 


Proof of Proposition 4: From the proof of Theorem 1, 
we have seen that the enumeration of Sk,n is given by the 
following irreducible rational function 


OO 

n—0 


^ + T,f=i rkii)z^ 
hk{z) 


where hk{z) is the denominator of ( 11 ) and is given by ( 12 ). 
Hence, it follows from the standard partial fraction technique 
and Proposition 3 that 

0 fc = max{|u|“^ : hk{u) = 0,u G C}, 
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where C is the set of complex numbers. Next, noticing that 
the function hk{z), i.e., det(l — A^z), also appears as the 
denominator of the enumeration function for „ (cf. (6)), 
we get 


Pfe < max{|u| ^ : hk{u) = 0,m € C}, 

since the rational function in (6) could be reducible. This 
shows gk<Qk- 

To prove > Qk (which then implies gk = Qk), it suffices 
to show that Ck,n +2 > Sfc,n for n > L. This can be done by 
substantiating that for any h G Sk(n), there exist a prefix bit p 
and a suffix bit q, where p,q g ¥, such that pbq G Ck(n + 2). 

Using the prove-by-contradiction argument, we first assume 
that k is an internal subword of both kpb and kpb, where 
p = 1—p. This assumption, together with b G Sk{n), implies 
the existence of indices 1 < i < L + 2 and 1 < j < L + 2 
such that 


Corollary 1: For a length-L UW k with overlap function 
rfc(i), let Ck,n be the number of length-n codewords in the 
UDOOC Ck defined as before. Then, for n > L, 


^k^n 


'L-1 

^ ^ t’fc(i) (2Ck^n—i—l *^k,n—i) 


'C2Ck^n—l ^k,n—L- 


Li=i J 

(18) 

Proof: To prove (18), we first note that the characteristic 
polynomial for kk is given by 


XAfcW = det(zI-Afc) 

= ^hfe(l/z) 

= z^^-^-^{z^hk{l/z)) 

where z^hk{l/z) is a polynomial with degree 
deghkiz) = degdet(l - A^z) = L. 


Denote 


fci • • • kLpbi ■ ■ ■ 5i_2 = kj ■ ■ ■ kLpbi ■ ■ ■ bj_2 = k, (14) 

'-V-" ^-V-^ 

= a = a 

where we abuse the notations to let 




■kLP, 

. 

II 

to 


a = < 

.h-- 

■ kLpbi ■ 

■ • bi-2, if 2 < i < L + 1 

(15) 


[pbi ■ 

■ ■ bi-2, 

. 

II 

+ 



and similar notational abuse is applied to a and j. Assume 
without loss of generality that i < j. Then, the sums of the 
last (j — 1) bits of a and d must equal, i.e., 

fcL-(j-i-l)+' ■ ■ + kL+P+bi + - ■ ■+bi-2 = P + bi + - ■ ■+bj-2- 


m = min{p > 0 : Nullity(A£) = 2^ ^ — L}, (19) 

where Nullity() indicates the dimension of the null space 
of the square matrix inside parentheses. By Cayley-Hamilton 
Theorem [18], the following polynomial 

Pfc(z) := z™(z'^/ifc(l/z)) 

= z™ ^z^ - 2z'^“^ + ^ rfc(i)z'^“*“^(z - 2) + 

( 20 ) 

is an annihilating polynomial for kk- We shall remark that 
/ifc(z) needs not to be the minimal polynomial for A^. Plug¬ 
ging (20) into (3) yields that for n — 1 > max{TO, L — 1}, we 
have 


Canceling out common terms at both sides gives 

kL-{j-i-i) + ''' + kL+p = p + bi-i + • • • + bj-2- (16) 

Note again that d, = fe; hence, substituting by kL-(. 

for f = 0,1, • • • , j—i—1 in (16) gives p = p, which contradicts 
the assumption that p = 1 — p. 

For the suffix bit q, we again assume to the contrary that 
there exist indices i and j, satisfying n+1— L < i < j < n+2, 
such that 

bi • • • bnqki • ' ' kL—n+i—2 ~ bj ' ' ' bnqki • • • kL — n^j — 2 — k 

^ ■v' ^ ■v' 

= =d 

(17) 

After canceling out common terms in the respective sums of 
the first {n + 2 — i) bits of d and d, we obtain 

bi + ■ ■ ■ + bj—i + q = q + ki + ''' + kj—i. 


C^k,n 


= xlkl+^-\ 

= xiki-^-^k]:^\ 


iZk 


( 21 ) 


= 


L-l 


L-1 


2kk + ^ ' t’fc(i)(2A 


L — i — 1 
k 


- ^kl - 1 


2=1 


Ik 


^ ^ Lk{k) { 2 Ck^n—i—l Ck^n—i) 


i=l 


+ 2 Ck^n—l Ck,n—L^ 

( 22 ) 


where the condition of n — 1 > max{m, T — 1} follows from i) 
n — 1 —m > 0 such that (21) holds, and ii) n — 1 > L—1 such 
that the last term of the RHS of (22) represents Ck,n-L- Finally, 
since rank(A^) < L for p = L — 1 (see Proposition 10), we 
have m < L — which immediately gives max{TO, L — 1} = 
L — 1. The proof is thus completed. ■ 


Since d — k, the above implies q = q, which again leads to a 
contradiction. ■ 

One application of the result hk{z) = det(l — Afcz) in 
Theorem 1 is to obtain a recursion formula for Ck,n, i e-, an 
LCCDE for Cfc_„. This is provided in the next corollary. 


So far, we learn that Ck,n and sj, „ have the same asymptotic 
growth rate, and both of their enumerations depend on det(l — 
Afcz). Below we will use Sk,n to determine the asymptotic 
growth rates corresponding to two specific UWs, a = 0 ... 00 
and b = 0... 01. We then proceed to show that a has the 
largest growth rate among all UWs of the same length, while 
the smallest growth rate is resulted when UW = b. 
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Theorem 2: Among all UWs of the same length, the all¬ 
zero UW has the largest growth rate, while UW 0... 01 
achieves the smallest. 

Proof: For notational convenience, we set a = 0 ... 0 and 
b = 0... 01. For UW = a, it can be verified from (10) and 
(11) that 

L 

h^{z) = l-Y,z\ (23) 

i=l 

and hence the sequence of {so,n}^i satisfies the following 
recursion; 

L 

^ ^ ^a,n—i for Tl L. 
i=l 

Similarly, we have hb{z) = 1 — 2z + z^, and therefore. 


9k ~ 9b^ ^ contradiction. Now with 1 < < g^, the follow¬ 

ing series of inequalities lead to the desired contradiction: 


n ^ + 

9b — 9b 


1 (‘h _ 

+ l<ff. 


-L+2 


(iii) 

+ 1 < 


where (i) follows from hb{z = = 0 and gb > 1, (ii) 

holds because g^^ ^ 9k 


Using a similar technique in the proof of Theorem 2, we 
can further devise a general upper bound and a general lower 
bound for g^ that hold for any k. 

Theorem 3: For any UW k of length L > 2, the asymptotic 
growth rate g^ satisfies 

2 _ 2-(^-2) <gk<2- 2"^. (28) 


^b,n — 2Si)^rt — l ^b,n—L fof Tl ^ L. 

For general UW k of length L, (11) gives the following 
recursion for n> L 

L-l 

^k,n ~ ^ ^ (25fc,n—i—1 ^k,n—i'} ^^k,n—l ^k,n—L- 

i—1 

(24) 

Note that r^f) G {0,1} by definition, and 2sk,m-i > Sfc,m 
for all m. From (24), the following bounds hold for any UW 
k with n > L: 

L 

25fc,n—1 ^k,n—L — ^k,n ^ ^ ^k,n—ii (^5) 

i=l 

where the lower and upper bounds are respectively obtained 
by replacing all rk{i) in (24) by 0 and 1. In particular, „ 
equals the upper bound in (25) when fc = a = 00 • • • 0, and 
the lower bound is achieved when A; is b = 00 • • • 01. By 
dividing all terms in (25) by Sk,n-i and taking n —)■ oo, we 
obtain 

9k^ ^ < 9k < 1 + 5fc ^ + ■ ■ ■ + fffe ■ (26) 

To prove our claim that g^ is the largest and gb is the smallest 
among all gk, we first assume to the contrary that there exists k 
with > ga- Substituting this into (26) leads to the following 
contradiction 

L 

9k < 2^ 9a = 9a, 

i=l 

where (i) holds because g^ ^ <9a^ by assumption and (ii) is 
valid because g~^ is a zero of ha{z) given in (23). 

To show gb achieves the minimum, again assume to the 
contrary that there exists k such that g^ < gb- Note from (26) 
that 

0 < - 2 + = (1 - 5^) (p7^+i + ... + - i). 

(27) 

Although (/fc > 1 in general, we claim in this case > 1. 
For otherwise, that hf,{z = g'^^ = 1) = 0 according to (12) 
implies that r^{i) = 0 for all z; hence, h^{z) = hb{z) and 


Proof: It is straightforward to see Sk,n-i < Sfc.n < 
2sfc,n-i and hence 1 < Pfc < 2. 

To prove the upper bound, we assume without loss of 
generality that (/fc > 1 since the upper bound trivially holds 
when pfc = 1. We then derive 

1 ih r (ii) r 

9k-i = gkii-gZ ) < i-5fc < 1-2-^, 

where (i) follows from multiplying both sides of the second 
inequality in (26) by (1 — 5}}^) with the fact (/fc > 1, and (ii) 
holds since gk <2. 

To establish the lower bound, we use the following series 
of inequalities: 

5fc(l - 9k^) = 9k-l 

>l-9-k^^-^^ 

= (1 “ (l+ 5fc^ + -I- 'z 9k^^ 

> (1 - <?fc (l + 2-1 -f 2-2 -p ... -p 

= (1-P^1)(2-2-(^-2)), (29) 

where (i) is from the first inequality in (26), and (ii) holds 
because gk < 2. Equipped with (29), we next distinguish two 
cases to complete the proof. 

1) When L = 2, the lower bound is trivially valid and is 
actually achieved by taking UW = 01 as poi = 1 is the 
multiplicative inverse of the smallest zero of polynomial 

hoi(z) = 1 — 2z + z'^ = (1 — 2 ) 2 . 

2) For L > 2, it suffices to show gk > 1. Assume to the 
contrary that there exists k of length L > 2 such that 
gk = 1. By hk{z = gf^ = 1) = 0 and (12), we have 
7fc(z) = 0 for all i and hence hk{z) = l — 2z + z^. Since 
gk is the multiplicative inverse of the smallest zero of 
hk{z), the absolute values of all the remaining zeros of 
hk{z), say Ai,..., Xl-i, must be strictly larger than 1. 
It then follows from the splitting of hk{z), i.e., 

L-l 

hk{z) = (2 - 1) (2 - Ai), 

Z=1 
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the constant term of hk{z) must have absolute value 
Uti |A.| > 1, contradicting to the fact that the constant 
term in polynomial hk{z) = 1 — 2z + is 1. 

■ 

Theorem 3 provides concrete explicit expressions for both 
upper and lower bounds on g^- Although the bounds are 
asymptotically tight and well approximate the true gk for mod¬ 
erately large L, they are not sharp in general. We can actually 
refine them using Theorem 2 and obtain that Pb < < 5 a, 

where from the proof of Theorem 2, we have 

5 a = max{|f| : haiz = t~^) = 0,f G C} 

and 

gb = max{|f| : hb{z = = 0,f G C} . 

The determination of ga and gb can be done via finding the 
largest |s| and |f|, 0 7 ^ s,f G C, such that ha{s~^) = 0 and 
hb{t~^) = 0, respectively. By noting that 

(z - 1) [z^ha {z~^)] = {z - l){z^ - z^~^ -1) 

= z^+^ -2z^ + 1 

and 

z^hb {z~^) = z^ — 2z^~^ + 1, 

we conclude the following corollary. 

Corollary 2: Let a = 0... 0 and b = 0... 01 be binary 
streams of length L. Then for any k of the same length to a 
and b, 

9b<9k< ga¬ 
in addition, ga = o-l+i and gb = ol, where 


TABLE I 

The asymptotic growth rates for UWs a and b and the bounds 
IN Theorem 3 with various L 


L 

2 

3 

4 

5 

6 

7 

8 

2 - 2-‘^ 

1.75 

1.875 

1.938 

1.969 

1.984 

1.992 

1.996 

9a. 

1.618 

1.839 

1.928 

1.966 

1.984 

1.992 

1.996 

9b 


1.618 

1.839 

1.928 

1.966 

1.984 

1.992 

2 — 


1.5 

1.75 

1.875 

1.938 

1.969 

1.984 


E. Asymptotic Equivalence 

After presenting the results on asymptotic growth rates, we 
proceed to define the asymptotic equivalence for UWs and 
show that the number of asymptotic equivalent UW classes is 
upper bounded by the number of different overlap vectors in 
Definition 4. 

Definition 5: Two UWs k and k' are said to be asymptot¬ 
ically equivalent, denoted by k = k', if they have the same 
growth rate, i.e., gk = gk’- 

Following the definition, we have the next proposition. 

Proposition 5: Fix the length L of UWs, and denote by 
the number of all possible overlap vectors of length L, i.e., 
= \{lLk ^ ^ Then, the number of asymptotically 

equivalent UW classes is upper-bounded hy N^. 

Proof: Since the growth rate of Sk,n is given by 
max{|f| :hk{z = = 0, f G C}, in which the polyno¬ 

mial hk{z), defined in ( 12 ), is completely determined by the 
respective overlap vector r^. As two different polynomials 
hk{z) and hk'{z), resulting respectively from two different 
overlap vectors and r^/, could yield the same growth rate, 
the number of distinct asymptotic growth rates of Sk,n for 
various k must be upper-bounded hy N^. The proof is then 
completed after invoking the result from Proposition 4 that 
Sk.n Cfc.n have the same growth rate. ■ 


aL ■= max{|f| : t^ — 2t^ ^-f 1 = 0, < G C}. 

In particular, we have « 2 — 2“^+^ for large L. 

Based on Theorem 3, the following corollary is immediate 
by taking L to infinity. 

Corollary 3: For any UW k of length L, the asymptotic 
growth rate of the corresponding UDOOC approaches 2 as 
L —> 00 , i.e., 

lim gk = 2. 

L—^oo 


One may find the number of asymptotically equivalent UW 
classes by a brutal force algorithm when L is small. With 
the help of Proposition 5, an efficient algorithm for its upper 
bound Nl is available in [25], in which r^. is regarded as 
(auto)correlations of a string. Values of Nl for various L are 
accordingly listed in Table II. This table shows the trend, as 
being pointed out in [17], that InA^^. grows at the speed of 
(InL)^, or specifically. 


—- < lim inf 

2 In 2 L —>00 


IniVL 

{InLf 


< lim sup 

L—fOO 


IhNl ^ 1 

(InL)^ “ 21nf 


(30) 


In Table I, we illustrate the asymptotic growth rates of 
UDOOCs for UWs a and b with lengths up to 8 . Also shown 
are the bounds in Theorem 3. It is seen that for moderately 
large L, all UDOOCs have roughly the same asymptotic 
growth rate, and hence are about the same good in terms 
of compressing sources of large size. Furthermore, having 
5 fc —> 2 as L —> 00 means that for very large L, UDOOCs can 
have asymptotic growth rates comparable to the unconstrained 
OOC, whose asymptotic growth rate equals 2. 


TABLE II 

Ni^ VALUES FOR VARIOUS L. IT IS STATED IN [25] THAT THE LOWER 
ASYMPTOTIC BOUND 1/(2 ln(2)) Ri 0.72 IN (30) ONLY HOLDS FOR VERY 
LARGE L; HENCE, THIS LOWER BOUND IS NOT VALID FOR L < 13 IN THIS 

TABLE. 


L 

1 2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

Nl 

1 2 

3 

4 

6 

8 

10 

13 

17 

21 

27 

30 

37 

-In A'r- 

- 1.44 

.91 

.72 

.69 

.65 

.61 

.59 

59 

.57 

.57 

.55 

.55 
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IV. Encoding and Decoding Algorithms of 
UDOOCs 


In this section, the encoding and decoding algorithms of 
UDOOCs are presented. Also provided are upper bounds for 
the averaged codeword length of the resulting UDOOC. 

Denote hy U = {ui,U 2 ,“‘ ,mm} the source alphabet of 
size M to be encoded. Assume without loss of generality that 
Pi > P 2 > • • • > Pm, where pi is the probability of occurrence 
for source symbol Ui. 

Then, an optimal lossless source coding scheme for 
UDOOCs associated with UW fc should assign codewords 
of shorter lengths to messages with higher probabilities and 
reserve longer codewords for less likely messages. By follow¬ 
ing this principle, the encoding mapping from U to 
should satisfy i{(f>k{ui)) < £{(j>k{uj)) whenever i < j, where 
£{4>k{ui)) denotes the length of bit stream (j>k{ui). The coding 
system thus requires an ordering of the words in Ck according 
to their lengths. This can be achieved in terms of the recurrence 
equation for Ck,n (for example, (22)). As such, (j)k{ui) must 
be the null word, and the mapping pk must always form a 
bijection mapping between {ui : Fk^n-i < i < Fk,n} and 
Ck{n) for every integer n > 1, where 


Fk.n 



if n > 0, 
otherwise. 


(31) 


This optimal assignment results in average codeword length; 

M 

Lk = i{k)+'^pi-e{4>k{ui)), (32) 

i=l 


where the first term £{k) accounts for the insertion of UW k 
to separate adjacent codewords. 


A. Upper Bounds on Average Codeword Length of UDOOCs 

The average codeword length Lk is clearly a function 
of the source distributions and does not in general exhibit 
a closed-form formula. In order to understand the general 
compression performance of UDOOCs, three upper bounds on 
Lk are established in this subsection. The first upper bound 
is applicable to the situation when the largest probability 
Pi of source symbols is given. Other than pi, the second 
upper bound additionally requires the knowledge of the source 
entropy. When both the largest and second largest probabilities 
(i.e.. Pi and p 2 ) of source symbols are present apart from 
the source entropy, the third upper bound can be used. Note 
that the third upper bound holds for all UWs and requires no 
knowledge about fe; therefore, one might predict that the third 
upper bound could be looser than the other two. Experiments 
using English text from Alice’s Adventures in Wonderland 
however indicate that such an intuitive prediction is not always 
valid. Nevertheless, the second upper bound is better than the 
first one in most cases we have examined. Details are given 
below. 

Proposition 6 (The first upper bound on Lk): Eor UW k 
of length L, the average codeword length Lk is upper-bounded 
as follows: 


where Nk is the smallest integer such that Fk,Nk ^ F[. 

Proof: It can be derived from (32) and £{(j)kiui)) = 0 

that 

M 

Lk = L+ '^pi£{(j)kiui)) 

i^2 

M 

< L + y^^Pi£(4>kiuM)) 

i^2 

= L+{l-pi)Nk. 


Proposition 7 (The second upper bound on Lk): Suppose 
Pfc > 1. Then 


r ^ T , ii(U)+pi\og2{pi) 

+ (l-pi)(l-loggJiTfc)) (34) 

where L{(pl) = P* log 2 (l/_Pi) is the source entropy with 
units in bits, Kk is a constant given by 

A:fc = min{pi-"-Ufc.„._i = (35) 

and rii is the smallest integer satisfying Fk^m > *• 

Proof: Erom the definitions of rii and Kk we have 

Kk Pk' ^ Fk^m-i <i < — 

Pi 

where the last inequality follows from that Pi < \ for 1 < 
i < M as Pi > P 2 >■■■ > Pm- By Pfc > 1 the above implies 


n^<l- \Ogg^ (KkPi) = 1 - lOgg^ (pi) - lOgg^ ( ) . 


Note that £{(j>k{ui)) < rii by the property of optimal lossless 
compression function fk- Consequently, we have 

M 

Lk= L + y^^Pi£{(j)k{ui)) 

1^2 

M 

< L + 

1^2 

M 

< L + '^p^{l- logg^ (pi) - logg^ (Kk)) 

1^2 

M M 

= L-J2p^ logg. (p^) + y] K (1 - logg, (Kk)) 

i—2 i—2 


^ ^ n{u) + pi\ogg^{pi) 

log2(5fc) 


P{l-pi)(l-\Ogg^Kk). 


The previous two upper bounds require the computations of 
either Nk, or pk and Kk', hence, they are functions of UW 
fc. Next we provide a simple third upper bound that holds 
universally for all UWs. 

Proposition 8 (The third upper bound on Lk): Eor UW fc 
of length L > 2, 

H(fc/) -fpilog2(pi) -fp2log2(p2) 

/o o9_r\ 


Lk F L + (1 — pi)Nk 


(33) 
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+ (2-2pi-p2). (36) 


Proof: First, we claim that 


Cfc,„ > 2”-2 for 2 < n < L + 1. (37) 


This claim can be established by showing that for any binary 
sequence b = bi... bm & F™, where 0<m = n — 2<L — 1, 
there exist a prefix bit p and a suffix bit q, where p, q G ¥, 
such that k is not an internal subword of kpbqk. This can be 
done in two steps: i) there exists q G ¥ such that k is not a 
subword of m := bqki~^, and ii) there exists p € F such that 
k is not an internal subword of kpu. 

Because the first step trivially holds when m = 0, we only 
need to focus on the case of m > 0. Utilizing the prove-by- 
contradiction argument, we suppose that fc is a subword of 
both bqk^~^ and bqki~^, where q = l — q. This implies the 
existence of indices l<z<j<TO + l such that 


bi • • • bjYiqki • • • kj^—m+i—2 — ' ' ' ^mQki • • • kL—ra+j — 2 — ^ 


= d 


where we abuse the notations to let 


{ bqki ■ ■ ■ kL-m-i, if z = 1 

h - ■■ bm.qki ■ ■ ■ kL-m+i-2, if 1 < z < w + 1 
qki ■ ■ ■ fci-i, if z = m + 1 


and similar notational abuse is applied to d and j. After 
canceling out common terms in the respective sums of the 
first (m + 2 — z) bits of d and d, we obtain 


fci + • • • + bj—i q — q ki kj—i. 


Since d = k, the above then implies q = q, which leads to a 
contradiction. The validity of the first step is verified. 

After verifying u = bqki~^ G Sk(m + L), we can follow 
the proof of Proposition 4 to confirm the second step (See 
the paragraph regarding (14) and (15)). The claim in (37) is 
thus validated. Note that the equality in (37) holds when k is 
all-zero or all-one. 

Next, we note also from the proof of Proposition 4 that 
Ck,n +2 > Sk,n for u > L. Since .Sk,L = 2^-1, we 
immediately have Ck.L +2 >2^ — 1. On the other hand, we 


can obtain from (25) that® 

> 2 - 2^-^ for n>L. (38) 

^k.n— 1 

This concludes: 

{ 1, if 0 < n < 1, 

2”“^, if 2 < n < A -I- 1, 

(2-22--^)””^”^(2^ - 1), ifn>T-f2, 

(39) 

where Ck.o = 1 because Cfe(O) contains only the null code¬ 
word, and Cfc i > 1 can be verified again by that k cannot be 
the internal sub word of both kpk and kpk. ^ The lower bound 
(39) then indicates that if 2^ -1 > (2- 2^“^)^ for L > 2, we 
can immediately have the following exponential lower bound 
for Ck^n-j i.c.. 


C/c,n ^ 


(2-2^-^) 


2—L\n—2 


if 0 < rz < 1, 
if rz > 2. 


(40) 


A stronger claim of 2^ — 1 > (2 — 2^ for L > 0 simply 
follows from 


2 ^- - 1 - (2 - 22 -^)^ > 2 ^ - 1 - 2^-\2 - 2 ^-^) = 1 . 


Hence, codeword lengths of the optimal UDOOC code must 
satisfy: * 

K4>kiui)) < log2_22-i.(z) + 2 for z > 3. 


Consequently, 

M 

Lk — L P ^ ^ (tzz)) 

i=2 


®We can prove (38) by induction. Extending the definition of iSfc(n) in (8), 
we obtain that sj, „ = 2" for 0 < n < L. This implies 


Sk,L 

Sk,L-l 


2^-1 


2^- 


= 2 - 2 ^ 


> 2 - 2 ^ 


and 


Now we 
i.e.. 


Q, 2^ 

= -- = 2 > 2 - 2^-^ for all 1 < m < L. 

suppose that for some n > L fixed, (38) is true for all 1 < m < n, 
>2- 2^-^ for all 1 < m < n. 


Then, we derive by (25) that 


^fc,Ti+l ^ 2 —Z/ + 1 ^ 2 ^k,n — L-\-l 

^k,n ^k,rL —Z/ + l(2 2^ L'^L 1 

= 2-{2-2^-^f-^ >2-2^-^. 


This completes the proof of (38). 

^If it were not true, then there exist indices i and j,^ < i < j < L+l, such 
that k = ki • • • k^pki • • • ki —2 = ''' ^Lp^i ‘ ‘ ‘ ^j- 2 \ hence, p — ki = 

p — kj with ki = kj = ki. The desired contradiction is obtained. 

^By (31) and (40), we have that for ^ > 3 and ni = £{(pfs{ui)). 


i ^ F fc,n^ — 1 — 


rij —1 

t > 2 + 


t=0 


(2 - _ 1 
1-22-L 


which implies log 2 _ 22 -u [(^ — 2)(1 — 2^ ■^) + 1] + 2 > rij = ^(0fc(rii)). 
Since {i — 2)(1 — 2^~^) + l<ifori>2 — 2^“^, we obtain 

mkiui)) < log2.2^-L[{i-2){l-2^-^) + l] + 2 < log2_22-i,(z)-b2. 
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M 


= L-\-p2 -\-^Pii{(l)k{ui)) 

i^3 

M 

< L P2 Pi (l0g2-22-^ (^) + 2) 

i^3 

M 

= L-\-2-2pi-p2-\- \og2-2^-L(i 


2 = 3 
M 


< L + 2-2;?! -P2 + y^Pi lQg2- 


22 - 1 . 


2 = 3 


P2 


(41) 


— L + 2 — 2pi — p2 

H{U) +pil0g2(pi) +P2l0g2(p2) 


log2(2-22-i) 

where (41) follows from that pi > P 2 ^ Pi implies 

Pi < i for 1 < i < M. ■ 


We next study the asymptotic compression performance of 
UDOOCs, i.e., the situation when the source has infinitely 
many alphabets. Note first that with complete knowledge of 
the source statistics {pi : i = 1,..., M}, the upper bound (34) 
in Proposition 7 can be reformulated using similar arguments 
as 


T ^ T , H(Z7)+pilog2(pi) 

+ (l-pi)(l-loggjrfc)) (42) 

where Tf, is given by 

Tfc = min{pi-”-Ffe,„._i : i = 2, ■ ■ ■ , M} , (43) 

and rii is the smallest integer satisfying Secondly, 

we can further extend the above upper bound (42) to the case 
of grouping t source symbols (with repetition) to form a new 
“grouped” source for UDOOC compression. The alphabet set 
of the new source is therefore 14* of size M*. Let L^ t be 
the per-letter average codeword length of UDOOCs for the t- 
grouped source. Then, applying (42) to the f-grouped source 
yields the following upper bound on 

T ^ ^ f T , H(Z7*) + gilog2(gi) 

+ (l- 5 i)(l-log,Jr^,,))Y (44) 

where 

= min : z = 2, • • • , M*} , (45) 

rzj i is the smallest integer satisfying Fp, and 

Qi is the zth largest probability of the grouped source. For 
independent and identically distributed (i.i.d.) source, we have 
H{ht*) = Moreover, assuming M > 1 and Pi < 1 for 

the nontrivial i.i.d. sources, we have gi = pi —> 0 as f —)■ (X), 
and Tk t can be shown to converge to some finite positive 
constant 

^fc,oo ■— lim p 

t—^OO 


x^adj (I - Afcz) k 


k 'Vk 


det (l — Afcz) 


(1 - gkz) 


where the last step follows from the conventional expansion 
theory for power series and also from the fact of gp, being the 
unique maximal eigenvalue of the adjacency matrix A^ under 
L > 2 (cf. Proposition 3). To elaborate, from the power series 
expansion, we have that = J^i where c is some 

constant, {Ai} is the set of nonzero distinct eigenvalues of A^, 
and ai „ is the coefficient associated with Xp (which could a 
polynomial function of n if Ai has algebraic multiplicity larger 
than one). In particular, assuming Ai is the largest eigenvalue, 
we can establish that oo = oi.n = cii, where the second 
equality emphasizes that ai „ is a constant independent of n 
since Aj"^ = ^ is a simple zero for hp^{z) when the digraph 

Gfc is strongly connected. 

By taking limits (letting t —> oo) on both sides of (44) 
and by noting that limt^oo qi = linit^oo p\ = 0 and Tfc^oo 
is some finite positive constant, we summarize the asymptotic 
compression performance of UDOOCs in the next proposition. 


Proposition 9: Given pfc > 1 and a nontrivial i.i.d. source, 
we have 


^ H(U) H(U) 

hm Lfc t < - - —- < -—— , . 

t^oo log2(pfc) log2(2 - 22--E') 


(46) 


Two remarks are made based on Proposition 9. First, 
the larger asymptotic bound H(G)/log2(2 — 2^“^) in (46) 
immediately gives 


lim lim Lp- p =W{hl). 

L—^oo i—>-oo 


Hence, if both t and L are sufficiently large, the per-letter 
average codeword length of UDOOCs can achieve the en¬ 
tropy rate H(G) of the i.i.d. source. Secondly, the bound of 
H(G)/log2((7fc) in (46) is actually achievable by taking the 
all-zero UW with the source being uniformly distributed. In 
other words. 


lim La,t = 

t—>-oo 


H(G) 

l0g2(5a) 


log2(M) 

l0g2(5a)’ 


(47) 


where a = 0 ... 0. For better readability, we relegate the proof 
of (47) to Appendix D. 

Tables III and IV evaluate the bounds for the English text 
source with letter probabilities from [36] and a true text source 
from Alice’s Adventure in Wonderland with empirical frequen¬ 
cies directly obtained from the book, respectively. The source 
alphabet of the English text and that from Alice’s Adventure 
in Wonderland is of size 27, where letters of upper and lower 
cases are regarded the same and all symbols other than the 26 
English letters are treated as one. It can be observed from Table 
III that bound (33) is always the best among all three bounds 
but still has a visible gap to the resultant average codeword 
length Lp.. Table IV however shows that the three bounds may 
take turn to be on top of the other two. Eor example, under 
k = a, (33), (34) and (36) are the lowest when (L, t) = (3,1), 
{L,t) = (5,2) and {L,t) = (6,3), respectively. Table IV also 
indicates that enlarging the value of t may help improving 
the per-letter average codeword length as well as the bounds 
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of UDOOCs. Comparison of the per-letter average codeword 
length of UDOOCs with the source entropy will be provided 
later in the simulation section. 

TABLE III 

Upper bounds (33), (34) and (36) on the average codeword 

LENGTH Lfc OE UDOOCS EOR ENGLISH TEXT SOURCE WITH LETTER 
PROBABILITIES EROM [36]. HERE, a = 0 ■ ■ ■ 0 AND 6 = 0 - ■ ■ 01. 


k 


L = 3 

L = 4 

L = 5 

L = 6 


Aa 

6.432 

7.411 

8.411 

9.411 

a 

(33) 

8.330 

9.330 

10.330 

11.330 


(34) 

9.606 

10.496 

11.488 

12.484 


Lb 

5.215 

6.185 

7.185 

8.185 

b 

(33) 

6.553 

7.553 

8.553 

9.553 


(34) 

10.385 

10.206 

10.889 

11.769 

- 

(36) 

10.831 

10.140 

10.652 

11.456 


TABLE IV 

UPPER BOUNDS (33), (34) AND (36) ON THE PER-LETTER AVERAGE 
CODEWORD LENGTH L^ t OF UDOOCS FOR ENGLISH TEXT SOURCE 
FROM Alice’s Adventure in Wonderland. HERE, a = 0 ■ ■ ■ 0 AND 
6 = 0 - 01 . 


k 



L = 3 

L = 4 

L = 5 

L = 6 



t 

= 

1 

5.773 

6.757 

7.757 

7.757 


LcL,t 


= 

2 

4.498 

4.920 

5.397 

5.891 



~T 

= 

3 

3.862 

4.089 

4.388 

4.709 



~t 

= 

1 

7.459 

8.459 

9.459 

10.459 

a 

(33) 


= 

2 

6.569 

7.069 

7.569 

7.608 



~T 

= 

3 

5.770 

5.786 

6.119 

6.134 



~t 

= 

1 

8.700 

9.596 

10.585 

11.580 


(34) 


= 

2 

6.548 

6.886 

7.333 

7.813 



~T 

= 

3 

5.771 

5.586 

6.120 

6.135 





1 

4.792 

5.774 

6.774 

7.774 


Lb,t 


= 

2 

3.791 

4.134 

4.598 

5.090 



~T 

= 

3 

3.455 

3.532 

3.802 

4.115 



~t 


1 

6.716 

6.973 

7.973 

8.973 

h 

(33) 


= 

2 

6.108 

6.147 

6.647 

7.147 



~T 

= 

3 

5.452 

5.150 

5.483 

5.816 



~t 

= 

1 

9.399 

9.366 

10.089 

10.984 


(34) 


= 

2 

7.356 

6.819 

7.040 

7.435 



~T 

= 

3 

5.453 

5.150 

5.483 

5.817 





1 

9.676 

9.221 

9.801 

10.632 

- 

(36) 

~t 

= 

2 

8.106 

7.035 

7.399 

7.816 




= 

3 

6.947 

5.815 

5.726 

5.889 


B. General Encoding and Decoding Mappings for UDOOCs 

In this subsection, the encoding and decoding mappings for 
a UDOOC with general UW are introduced. 

The practice of UDOOC requires the encoding function fk 
to be a bijective mapping between the subset of source letters 
^k{n) ■= {um ■ Fk,n-i < m < Ufc „} and the set of length- 
n codewords Ck{n) for all n. Since the resulting average 
codeword length will be the same for any such bijective 
mapping from Uk{n) to Ck{n), we are free to devise one that 
facilities efficient encoding and decoding of message Um- The 
bijective encoding mapping (j>k that we propose is described 
in the following. 

We define for any binary stream d of length < n, 

Ck{d,n) := {c € Ck{n) : d is a prefix of c, or c = d} . 

(48) 

Obviously, Cfc(d, n)nCfc(d, n) = 0 for every pair of distinct d 
and d of the same length, and for any fixed i with 1 < t < n. 


Then, given message Um G ldk{n), i.e., the number n is 
chosen such that „_i < m < Fj-„, the proposed encoding 
mapping fk produces the codeword (f>k{um) = CiC2 • • • Cn 
for source letter Um recursively according to the rule that for 
i = 1,2,... ,n. 


( 0, if p*-i < |Cfc(ci • • •Ci_i0,n)| 

\ 1, if > |Cfc(ci •••Cj_i0,n)| 


where the progressive metric pi is also maintained recursively 
as: 


Pi ■— Pi — 1 Q |Cfc(ci 

•Ci_i0,n)| 


_ f Pi-1: 


o 

II 

Pi — 1 

• • Ci-i0,n)\, 

if Ci = 1 



(51) 

with an initial value po = 

m - Fk,n-i- 

This encoding 


mapping actually assigns codewords according to their lex¬ 
icographical ordering. 

Example 3: Taking k = 010 as an example, we can see 
from Fig. 4 that the seven codewords of length 4, i.e., 0000, 
0011, 0110, 0111, 1100, 1110 and 1111, will be respectively 
assigned to source letters ug, Uio, Mh, Mi 2, Ui 3 , uh and U15. 
The progressive metrics po, pi,p 2 , P 3 for source letter un are 
3,3,1,1, respectively, with |Cfc(0,4)| = 4, |Cfc(00,4)| = 2, 
|Cfc(010,4)| = 0 and |Cfc(0110,4)| = 1. ■ 

Note again that given m (equivalently, Um), n can be 
determined via Ffc „_i < m < Fk.n- At the end of the nth 
recursion, we must have 

n 

m = |Cfc(ci“^0,n)| + 1. (52) 

We emphasize that (52) actually gives the corresponding 
computation-based decoding function fk ■ Ck{n) -A Uk{n) 
for codewords c of length n. 

One straightforward way to implement fk and fk is to 
pre-store the value of \Ck{dll,n)\ for every d and n. By 
considering the huge number of all possible prefixes d for 
each n, this straightforward approach does not seem to be an 
attractive one. 

Alternatively, we find that \Ck{d0,n)\ can be obtained 
through adjacency matrix introduced in Section II. The 
advantage of this alternative approach is that there is no 
need to pre-store or pre-construct any part of the codebook 
Ck, and the value of \Ck{d0,n)\ is computed only when 
it is required during the encoding or decoding processes. 
Moreoever, for specific UWs such as 00 ... 0, 00 ... 01, and 
their binary complements, we can further reduce the required 
computations. 

In the following subsections, we will first introduce the 
encoding and decoding algorithms for specific UWs as they 
can be straightforwardly understood. Algorithms for general 
UWs require an additional computation of |Cfc(rf0, n)| and will 
be presented in subsequent subsections. 


C. Encoding and Decoding Algorithms for UW = 11... 1 


^k{n) = y Ck{d,n). 

dGF* 


(49) It has been inferred from Proposition 2 that the encoding 
and decoding of the UDOOC with UW k = 00... 0 can be 
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equivalently done through the encoding and decoding of the 
UDOOC with UW k = 11... 1 as one can be obtained from 
the other by binary complementing. Thus, we only focus on 
the case of fc = 11... 1 in this subsection. 

For this specific UW, we observe that a codeword c = 
dOb G Ck(dO,n) if, and only if, Ob G Ck{n — i{d)) is a 
codeword of length n — £{d), where £{d) is the length of 
prefix bitstream d. We thus obtain 


|Cfc(ciO,Tr)| — ^k,n—i{d)- (53) 


It can be shown that the LCCDE for „ with fc = 11... 1 is 


L 

^h.n — ^ ^ C-k^n—it for all Tl ^ U, 

i=l 


where the initial values are 




1, n = 0,l,2, 

n = 3,...,L. 


(54) 


(55) 


Based on (53), (54) and (55), the algorithmic encoding and 
decoding procedures can be described below. 


Algorithm 1 Encoding of UDOOC with fc = 1 1... 1 
Input: Index m for message Um 
Output: Codeword (j)k{um) = ci... c„ 

1: Compute Ck.o,Ck,i,Ck, 2 , ■ ■ ■ using (54) and (55) to deter¬ 
mine the smallest n such that > m. If n = 0, then 
4>k{um) = null and stop the algorithm. 

2: Initialize po ^ m — „_i 

3: for i = 1 to n do 
4: if p^ — \ "G Ck^n—i-\-\ thCU 

5: Ci 3- 0 and p* 3- pj_i 

6 : else 

7: Ci 3 1 and Pi 3 Pi —1 ^k,n—i-\-\ 

8 : end if 

9: end for 


Algorithm 2 Decoding of UDOOC with fc = 11... 1 
Input: Codeword c = ci... c„ 

Output: Index m for message Um = '4’k{c) 

1: Compute Cfc^o, Cfe.i, • • ■, Cfc,„ using (54) and (55) 
2: Initialize m 3— F^ n-i + 1 
3: for i = 1 to n do 
4: if Ci = 1 then 

5: m 3-m + Cfc,„_i+1 

6 : end if 

7: end for 


D. Encoding and Decoding Algorithms for 11 • • • 10 

Again, Proposition 2 infers that the encoding and decoding 
of the UDOOC with UW fc = 00... 01 can be equivalently 
done through the encoding and decoding of the UDOOC 
with UW fc = 11... 10. We simply take fc = 11 • • • 10 for 
illustration. 


It can be derived from (12) that for fc = 11 • • • 10, 

Cfc,n = (56) 

with initial condition 

r 1, n = 0 

Ck,u={ 2", n = l,...,L-l (57) 

[ 2 ^- 1 , n = L. 

It remains to determine |Cfc(<i0,n)|. Observe that dOb G 
Ck{d0,n) if, and only if, b G Ck{n — £{d) — 1); hence, 
|Cfc(d0,n)| = Ckn-i(d)-i- We summarize the encoding and 
decoding algorithms of UDOOCs with fc = 11... 10 in 
Algorithms 3 and 4, respectively. 


Algorithm 3 Encoding of UDOOC with fc = 11 • • • 10 
Input: Index m for message Um 
Output: Codeword (j)k{um) = ci... c„ 

1: Compute Cfe^O) Cfe.i, Cfc, 2 , ■ • ■ using (56) and (57) to deter¬ 
mine the smallest n such that Fj. „ > m. If n = 0, then 
4>k{um) = null and stop the algorithm. 

2: Initialize po m — F^ „_i 
3: for i = 1 to n do 
4: if p^ — \ "G Ok^n—i thCU 

5: Cj i — 0 and p^ i — Pi—i 

6 : else 

7: Ci i 1 and Pi i pj^—\ Ok^n—i 

8 : end if 

9: end for 


Algorithm 4 Decoding of UDOOC with fc = 11 • • • 10 
Input: Codeword c = ci... c„ 

Output: Index m for message Um = V'fe(c) 

1: Compute Ck,o, Cfe.i, ■ • ■ , Cfc,„ using (56) and (57) 

2: Initialize m 3— Fj- „-i + 1 
3: for i = 1 to n do 
4: if Ci = 1 then 

5: m 3- m + Ck,n-z 

6 : end if 

7: end for 


E. Encoding and Decoding Algorithms for General UW fc 

It is clear from the discussions in the previous two subsec¬ 
tions as well as from (50) that to determine Ci in the encoding 
algorithm, we only need to keep track of the most recent 
Pi- 1 , instead of retaining sequentially all of po,..., Pi- 2 - We 
address the recursion for the update of pi in (51) only to 
facilitate our interpretation on the operation of the progressive 
metric. The same approach will be followed in the presentation 
of the general encoding algorithm below, where a progressive 
matrix Di is used in addition to the progressive metric pi. 

The encoding algorithm for general UWs consists of two 
phases. Given the index m, we first identify the smallest n such 
that Ffe „ > m. Note that the computation of F^ „ requires the 
knowledge of Cfe „, which can be recursively obtained using 







16 


the LCCDE in (18). In the second phase, as seen from the two 
previous subsections, we need to determine the cardinality of 
Ck{d,n) for any prefix d with £{d) < n. Thus, our target in 
this subsection is to provide an expression for \Ck{d,n)\ that 
holds for general k and d. 

Define and Efc i for digraph Gk = {V,Ek) as 

Ekfi ■= {{i,j) & Ek : ji-i = 0} (58) 

Eks ■= {(*, j) € Ek : jL-i = 1} . (59) 

Literally speaking, Ek^o (resp. Ek^i) is the set of edges in Ek, 
whose ending vertex has its last bit jl-i equal to 0 (resp. 1). 
Let Afc 0 and be the adjacency matrices respectively for 
digraphs Gk,o = {V,Ek,o) and Gk,i = {V,Ek,i). Obviously, 
Afc = Afc 0 + Based on the two adjacency matrices, we 
derive 

\Ck{d,n)\ = (60) 

where for a prefix stream d = di ... di, 

i 

D. ( 61 ) 

t=i 

and Xf, and are the initial and ending vectors for digraph 
Gk defined in Section III-A. With (60) and (61), the general 
encoding and decoding algorithms are given in Algorithms 
5 and 6, respectively. Verification of the two algorithms is 
relegated to Appendix C for better readability. 


Algorithm 5 Encoding of UDOOC with General k 
Input: Index m for message Um 
Output: Codeword (j)k{um) = ci... c„ 

1: Compute Cky, Ck,i, Ck, 2 , ■ ■ ■ using (18) and the method in 
Section IV-L to determine the smallest n such that Lj-„ > 
m. If Ti — 0, then — null and stop the algorithm. 

2: Initialize po m — Ek^n-i and Dq ^ I 
3: for i = 1 to n do 
4: Compute t/iimmy 3— 

5: if < dummy then 

6: Ci 3— 0, Pi 3— pi-i and 3— Di_iAfc^o 

7: else 

8: Ci <— 1, Pi <— Pi-i — dummy and 3— Di_iAfc_i 

9: end if 

10: end for 


Algorithm 6 Decoding of UDOOC with General k 
Input: Codeword c = ci... c„ 

Output: Index m for message Um = 'tpk{c) 

1: Compute Ck,o, Ck,i, ■ ■ ■, Ck,n using (18) and the method in 
Section IV-L. 

2: Initialize m 3— Ek^n-i + 1 and Dq 3— I 
3: for 1 = 1 to n do 

4: if Ci = 1 then 

5: m + 

6 : end if 

7: Di 3— Di_iAfc^ci 

8 : end for 


F. Exemplified Realization of the Encoding and Decoding 
Algorithms for General UW k 

The matrix expressions in (60) and (61) facilitate the 
presentation of Algorithms 5 and 6 for general UW; however, 
their implementation involves extensive computation of matrix 
multiplications. Since the entries in each row or column of 
Afc are all O’s except for at most two I’s, the complexity of 
computing 

Ck,n=xlkl+^-\ (62) 

and 

|Cfc(d0,n)| = (63) 

is in fact relatively small. Lurthermore, it is much easier to 
compute Ck,n than \Ck{dQ,n)\. To see this, note from (6) that 
we have the following enumeration for Cfe „ 

x^adj(l-Afc0) 
det (l — Afcz) 

Thus, simply evaluating the RHS of the above equation gives 
the values of Ck,n for n = 1,2,...,L — 1. The remaining 
values of Ck.n for n > L can be easily determined through 
the recursion formula (18). 

Another way to compute the values of Ck,n can be easily 
obtained by modifying the algorithm for computing the values 
of \Ck{dQ,n)\, which we now discuss. The first step to 
compute \Ck{dQ,n)\ is to break up formula (63) into: 

|Cfc(d0,n)| = (x^D^(d)) • (64) 



We note that from the choice of d in the encoding algorithm 
5, we must have \Ck{d,n)\ > 1.® Hence, u = is 

actually a zero-one indication vector of length for the 

rightmost (L — 1) bits of k^d, i.e., all components of vector 
u=[uiU 2 ■ ■ ■ U 2 L-d\^ O’s except the (j-l-l)th component 
(being I’s), where j is the integer corresponding to the binary 
representation of the rightmost (L — 1) bits of k^d. Hence, 
u can be directly determined without any computation. In 
addition, we can pre-compute since it is the same for all 
n and d. Our task is therefore reduced to computing the value 
of 

|Cfc(d0,n)| = Vfc- 

Below, we demonstrate how to utilize a finite state machine 
based on the digraph Gk to evaluate |Cfe(d0,n)| without 
resorting to matrix operations. 

Notations that are used to describe the finite state machine 
are addressed first. Let © = {soo-.-O; soo -oi) • ’' )Sii -i} be 
the set of states indexed by all binary bit-streams of length 
L—1. We say Si = Si^...ij^_^ is a counting state if the (i-l-l)th 
component of w^. is 1, where i is the integer corresponding to 


*Given any choice of prefix d, it is possible that \Ck{d, n)| = 0 if d ^ Cfc, 
and in this case we have n = 0 in (64). However, the prefix d considered in 
our encoding algorithm, Algorithm 5, is always a prefix of some codeword; 
hence we have \Ck{d,n)\ > 0. 


^ ^ C-k.n^ 
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binary representation of i = ii .. Denote by the 

set of all counting states corresponding to k. Also, for each 
state Sfc G © we define 

= {sj '■ {j a) € Ek} ) 

0{si) = {Sj : {i,j)€Ek}, 

Ihisi) = {sj : {j,i) € fi-fc.&l) 

E^b{si) = {Sj : (*J j) € ) 

for & = 0,1, where the edge-sets Ek^o E^ i are defined in 
(58) and (59), respectively. Literally speaking, from digraph 
Gfc, X{si) is the set of states that link directionally to Sj, 
0{si) is the set of states that are linked directionally by Si, 
and Xo{si) is the set of states that link to Si via a so-called 
0-edge in E^^. The sets Ii{si), Oo{si) and Oi{si) all have 
in a similar meaning. 

In our state machine, we associate each state Sj with an 
integer. Without ambiguity, we use s* to also denote the integer 
associated with it. Define an operator : 6 —> ©, which 
updates the value associated with each state according to: 

Sfc : Si ^ ^ Sj for all Sj G ©. 

Sj GX(si) 

It should be noted that the operator Ef, updates all states in © 
in a parallel fashion. Also, if I{si) is an empty set, operator 
Sfc would set Si ^ 0. We similarly define operators Sj- o 
Sfc.i respectively as 

^k.O ■ ^ ^ ^ ttltd ^k.l • ^ ^ ^ ^j' 

Sjeio(si) sj ell (si) 

An example is provided below to help clarify these nota¬ 
tions. 

Example 4: For UW k = 000 of length L = 3, there 
are four possible states in © = {sqo, sqij Sioj •Sii}- Because 
^ ® ~ {sor,sii}. Create 

the 0-edges and 1-edges of the digraph in Fig. 6. Table V 
then shows I{si), 0{si), lo(si), Iiisi), e>o(si) and Oi{si) 
for each s*. 


TABLE V 

Various state sets for UW k = 000 


Si 

<500 

Soi 

SlO 

Sll 

I{Si) 

{sio} 

{soo, <5lo} 

{^01, <Sll} 

{^01, <Sll} 

Oisi) 

{soil 

{^10, <Sll} 

{sOOl <50l} 

{siOi <Sll} 

To(si) 

{«io} 

{} 

{^oii <5ll} 

{} 


{} 

{sOOt <Slo} 

{} 

{^01, -Sll } 

Oo{si) 

{} 

{sio} 

{soo} 

{sio} 

Oi(si) 

{soi} 

{sill 

{soil 

{sill 


According to the first row in Table V, the operator 
simultaneously updates all states in © according to 

Soo ^ SlO 

~ . Soi Sqo + Sio 

^ Sio ^ Sqi + Sii 
Sir •<— Sqi + sii. 



k=000 


Fig. 6. Digraph Gqoo for UW k = 000 


Likewise, the operators o and i simultaneously update 
all states in © according to 




•SOO ^ SlO 

soi 0 

Sio SOl + Sll 

Sir k— 0 


and 


^k,l 


SOO ^ 0 

soi ^ Sqo + sio 

sio 0 

Sir k— Sqi + sii. 


With the above, we now demonstrate how to compute Ckj 
and |Cfc(<i0,n)| using the finite state machine. Note Ck,n = 
; hence to compute Ck,n, the states are initialized such 
that Sk 2 ...kL = 1 and Si = 0 for all remaining i ^ k^- Note 
that these initial values correspond exactly to the component 
values of vector x^,. Next we apply n times the operator 
to update the states in ©. It can be seen that the resulting 
values of the states correspond exactly to the contents of the 
row vector Thus, the value of Ck^n can be obtained by 

summing the values of the counting states. Again, we remark 
that we only need the finite state machine for computing the 
values of Ck,n for n = 1, 2,..., L — 1, as the values of Ck,n 
for n> L can be easily determined by the recursion formula 
( 18 ). 

On the other hand, to compute 

|Cfc(d0,n)| = Vfc 

for a given prefix d, we initialize the values associated 
with all states to be zero except Su^_j^^ 2 ---ur„ = where 
Um-L +2 • • • Um Is the rightmost (L — 1) elements in m = 
Ui ... Ujn = k^d. Apply the operator g to all states in © 
once, followed by updating all the states (n — f (d) — 1) times 
via operator Then, the sum of the values of all counting 
states equals |Cfc(d0,n)|. 

V. Practice and performance of UDOOCs 


*®Here we implicitly use a fact that is a binary zero-one vector. Note 
that the {i + l)th component of = [mi W 2 ■ ■ ■ 

is equal to the number of distinct walks from vertex ii ■ ■ ■ — \ to vertex 

k\ ■ ■ ■ on digraph Gfe. This fact follows since there is at most one 

walk of length (L — 1) between the above two vertexes. 


In Fig. 7, we compare the numbers of length-n code¬ 
words for all UWs of lengths L = 2, 3, 4 and 5. These 
numbers are plotted in logarithmic scale and are normalized 
against the number of length-n codewords for the all-zero 
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UW k = 0... 00 to facilitate their comparison. By the 
equivalence relation defined in Definition 2, only one UW in 
each equivalence class needs to be illustrated. We have the 
following observations. 

1) The logarithmic ratio log2(cfc „/ca,n)5 where a = 
0... 00, exhibits some transient fluctuation for n < L 
but becomes a steady straight line of negative slope after 
n > L. This hints that Ck,n has a steady exponential 
growth when n is beyond L. 

2) The number Cb,n, where b = 00... 01, is always the 
largest among all „ when n is small. However, this 
number has an apparent trend to be overtaken by those 
of other UWs as n grows and will be eventually smaller 
than the number of length-n codewords for the all-zero 
UW. This result matches the statement of Theorem 2. 

3) As a contrary, the number Co.„ for the all-zero UW 
a = 0... 00 is the smallest among all Ck,n for UWs of 
the same length when n is small. Although Theorem 2 
indicates that this number will eventually be the largest. 
Fig. 7 shows that such would happen only when n is 
very large. 

4) As a result of the two previous observations, UW 
b = 00... 01 perhaps remains a better choice in the 
compression of sources with practical number of source 
letters even though it is asymptotically the worst. We 
will confirm this inference by the later practice of 
UDOOCs on a real text source from the book Alice’s 
Adventure in Wonderland. 

We next investigate the compression rates of UDOOCs 
and compare them with those of the Huffman and Lempel- 
Ziv (specifically, LZ77 and LZ78) codes. In this experiment, 
the standard Huffman code in the communication toolbox 
of Matlab is used instead of the adaptive Huffman code. 
The LZ77 executable is obtained from the basic compression 
library in [34], while the LZ78 is self-implemented using 
Ch-h- programing language. As a convention, the data is binary 
ASCII encoded before it is fed into the two Lempel-Ziv com¬ 
pression algorithms. The sliding window for the LZ77 is set 
as 10,000 bits, and the tree-structured LZ78 is implemented 
without any windowing. 

Three different English text sources are used, in which the 
uppercase and lowercase of each English letter are treated as 
the same symbol. The first English text source is distributed 
uniformly over the 26 symbols. The second English text source 
is assumed independent and identically distributed (i.i.d.) with 
marginal statistics from [36]. The third one is a realistic 
English text source from Alice’s Adventure in Wonderland, 
in which any symbols other than the 26 English alphabets 
are regarded as a “space.” In addition, the effect of grouping 
t symbols as a grouped source for compression is studied, 
which will be termed f-grouper in remarks below. The results 
are summarized in Tables VI and VII, in which the average 
codeword length of UDOOCs has already taken into account 
the length of UWs. We remark on the experimental results as 
follows. 

1) Eirst of all, it can be observed from Table VI that the 
length-2 UW fc = 01 gives a good per-letter average 


codeword length only when t = \. When the size of 
source alphabet increases by grouping < = 2 or f = 3 
letters as one symbol for UDOOC compression, the 
per-letter average codeword length dramatically grows. 
Note that fc = 01 is the only UW, whose number of 
length-n codewords has a linear growth with respect 
to n, i.e., we have coi,n = n + 1. Since the size of 
source alphabets increases exponentially in t when t- 
grouper is employed, the resulting per-letter average 
codeword length also increases exponentially as t grows. 
Therefore, when UW = 01, f-grouper will result in an 
extremely poor performance for moderately large t. 

2) By independently generating 10® letters according to 
the statistics in [36] for compression, we record the 
per-letter average codeword in the second row of Table 
VI. As expected, the Huffman coding scheme gives the 
smallest per-letter average codeword length of 4.253 
bits per letter, when 3-grouper is used. The gap of per- 
letter average codeword lengths between the 3-grouper 
Huffman and the 3-grouper UDOOC however can be 
made as small as 4.795 — 4.253 = 0.542 bits per source 
letter if UW = 0001. This is in contrast to the gap 
of 1.007 bits when uniform independent English text 
source is the one to be compressed (cf. the first row 
in Table VI). We would like to point out that the error 
propagation of UDOOCs is limited firmly by at most 
two codewords, while that of the Huffman code may be 
statistically beyond this range. In comparison with the 
LZ77 and LZ78, the UDOOC clearly performs better 
in compression rate for usual independent English text 
source. 

3) When the compression of a source with memory such as 
the book titled Alice’s Adventures in Wonderland [35] 
is concerned, the third row in Table VI shows that the 
gap of per-letter average codeword lengths between the 
optimal 3-grouper Huffman and the 3-grouper UDOOC 
with UW = 0001 is narrowed down to 0.305 bits per 
letter. The 3-grouper UDOOC with the all-zero UW 
also performs well for this source. Note that part of 
the per-letter average codeword length of UDOOCs is 
contributed by the UW, i.e., L/f; hence, in a sense, a 
larger t and a smaller L are favored (except for L = 2). 
As can be seen from Table VI, the best compression 
performance is given by < = 3, L = 4, and UW = 0001. 

4) Eor the third English text source, the LZ77 performs 
better than all of the 1-grouper UDOOC compression 
schemes but one. We then compare the running time 
of both algorithms. We reduce the window size of 
LZ77 so that it has a similar running time to the 1- 
grouper UDOOC scheme. The compression performance 
of LZ77 degrades down to 5.234 bits per letter, which 
is larger than that of the 1-grouper UDOOC. Note that 
we only compare their running time in encoding in 
Table VII as the decoding efficiency of UDOOCs is 
seemingly better than that of the LZ77. Considering 
also the low memory consumption of UDOOCs when a 
specific UW is pre-given in addition to its simplicity in 
implementation, the UDOOC can be regarded as a cost- 
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Codeword Leanth n 
(b) L = 3 




Fig. 7. Normalized numbers of length-n codewords for UWs of lengths L = 2, 3, 4, 5 


effective compression scheme for practical applications. 


Appendix A 
Proof of Theorem 1 


TABLE VII 

Average codeword lengths in bits per source symbol and 
RUNNING TIME IN SECONDS EOR THE UDOOC ENCODING AND THE LZ77 
ENCODING ON Alice’s Adventures in Wonderland. THE PROGRAMS ARE 
IMPLEMENTED USING C++, AND ARE EXECUTED IN A MICROSOFT 

Windows-based desktop with intel-Core7 2.4G CUP and 8G 

MEMORY. 


In this section, we will prove (11), the enumeration of Sfe „ 
in Theorem 1. Our proof technique is similar to that in [23]. 

LetF- :=U„>o F” be the set of all binary sequences. For 
a word w = wi ... Wn G F°° of length n, let Fk{w) be the 
set of index pairs indicating the places that w contains k as 
a subword, i.e.. 


Type 

Average Codewrod length 

Running Time 

UDOOC 

UW fe = 00 

4.887 

0.0162 sec 

UW fe = 01 

4.068 

0.0158 sec F 

LZ77 

Window Size = 10^ bits 

4.661 

0.0328 sec 

Window Size = 3000 bits 

5.234 

0.01607 sec 


J^kiw) = : k = wiY 


/(^) = 


n>0 


^k.n^ 




VI. Conclusion 


= E n (!+(-!)) 

a^Tk.{w) 


In this paper, we have provided a general construction 
of UDOOCs with arbitrary UW. Combinatorial properties 
of UDOOCs are subsequently investigated. Based on our 
studies, the appropriate UW for the UDOOC compression 
of a given source can be chosen. Various encoding and 
decoding algorithms for general UDOOCs, as well as their 
efficient counterparts for specific UWs like k = 00... 0, 
00... 01, are also provided. Performances of UDOOCs are 
then compared with the Huffman and Lempel-Ziv codes. Our 
experimental results show that the UDOOC can be a good 
practical candidate for lossless data compression when a cost- 
efficient solution is desired. 


1 Y (65) 

■u)GF“ ACjr^{w) 

where in (i) we have adopted the convention of 0° = 1, and 
(ii) follows from the inclusion-exclusion principle. In light of 
(65), we will regard the pair {w,A) with A C Fk{w) as a 
marked word. The set of all marked words is thus defined as 

Aik '■= : lu G F°° and A C . 

Define the following weight function for elements 'm AAk 

7r{w, A) 




( 66 ) 
































TABLE VI 

Average codeword lengths in bits per source symbol for the compression of three different sources. The best one among 
1-GROUPER, 2-grouper AND 3-GROUPER OF THE SAME COMPRESSION SCHEME IS BOLDFACED. 
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Type 

Entropy 

LZ77 

LZ78 

Huffman 


uw 

= 00- • 

• 0 

UW 

= 00-- 

•01 


t = 1 

t = 2 

t = 3 



t = 1 

t = 2 

t = 3 


f == 1 

t ^ 2 

t ^ 3 

t ^ 1 

t ^ 2 

t ^ 3 

Independent 









L 

~ 2 

6.961 

6.820 

6.790 

5.846 

12.76 

41.99 

English Letter with 

4.700 

4.700 

4.700 

7.992 

7.178 

4.768 

4.738 

4.702 

L 

= 4 

8.576 

6.831 

6.213 

7.000 

5.899 

5.709 

Uniform distribution 









L 

^ 6 

10.58 

7.748 

6.746 

9.000 

6.768 

6.104 


f ^ 1 

t ^ 2 

t ^ 3 



f ^ 1 

t ^ 2 

t ^ 3 


t ^ 1 

f ^ 2 

t ^ 3 

t ^ 1 

t ^ 2 

t ^ 3 

Independent 









L 

= 2 

5.591 

5.550 

5.637 

4.557 

7.771 

20.907 

English Letter with 

4.246 

4.246 

4.246 

7.925 

6.626 

4.274 

4.261 

4.253 

L 

^ 4 

7.411 

5.872 

5.351 

6.185 

4.970 

4.795 

Usual distribution 









L 

^ 6 

9.411 

6.818 

5.924 

8.185 

5.882 

5.274 


f ^ 1 

t ^ 2 

t ^ 3 



f = 1 

t ^ 2 

t ^ 3 


t = 1 

f ^ 2 

t ^ 3 

t ^ 1 

t ^ 2 

t = 3 

Alice’s 









L 

= 2 

4.887 

4.340 

3.958 

4.068 

4.975 

7.573 

Adventures 

3.914 

3.570 

3.215 

4.661 

6.028 

3.940 

3.585 

3.226 

L 

^ 4 

6.757 

4.920 

4.089 

5.774 

4.133 

3.531 

in Wonderland 









L 

= 6 

8.757 

5.890 

4.709 

7.774 

5.089 

4.115 


then (65) can be rewritten as 

fiz) = T^iw^A). (67) 

{w,A)eMk 

To determine f{z), below we introduce the concept of a 
cluster. 

Definition 6 (Cluster): We say the marked word {w,A) is 
a cluster if, and only if, 

U [H,jt] = [he{w)] 

where by [a, h] we mean the closed interval {x : a < x < 
b} on the real line. The set of all clusters is thus 

Tfc = {(tUjA) e Aik ■ i'WjA) is a cluster} . 

Definition 7 (Concatenation of sets of marked words): For 
any two sets of marked words Ak and Bk, we define the 
concatenation of Ak and Bk as 

Ak^Bk ■■= {{ab,A[JZ{BA{a)) : [a, A) G Ak, {b,B) G Bk} 

where by ab we meant the usual concatenation of strings a 
and b, and the function Z{B,£{a)) is 

Z(B,£{a)) := {{f + i(a), jt + ^ia)) ■ {h, jt) & B} . 

Having defined the concatenation operation V for sets of 
marked words, we next claim the following decomposition 
for the set Aik 

Aik = {(null, 0)1 U(XfcVJ-)U(AffcVrfc), (68) 
where := {(5,0) : 6 G F}. 

To show (68), for any (w, A) G Aik we distinguish the 
following three disjoint cases: 

1) If i{w) = 0, it is obvious that in is a null word and 
A = % from the definition of Fk{w). 

2) For £(w) > 1, appending an arbitrary binary word to w 
results in another marked word {wb,A), which cannot 
be a cluster since 

IJ [it,jt]c[i,e{w) + i]. 

(H,jt)&A 


Conversely, take any marked word {w,A) from Aik 
with £(w) = n. If jt < £{w) = n for all G A, 

then we can delete the rightmost bit from w, and 
the resulting pair is still a marked word. 

Summarizing the above gives the following equalities 
between two sets of marked words 

{(ut,^) G Aik ■■ jt < £{w) for all {it,jt) G vl)} 

= {{wb, A) : (w, A) G Aik, 6 G F} 

= MkVJ-, (69) 

where the last equality follows from the definition of 
concatenation operation V. 

3) The last case concerns the situation when {w,A) sat¬ 
isfies £{w) = n > 1, A = ..., (im, Jm)} and 

ii < ■ ■ ■ < im < jm = nAn other words, this is the case 
when maxjj't : (it,jt) G A} = £{w), which is disjoint 
from the second case. For this, let u be the smallest 
index such that [iu+t,ju+t] C [iu+t+i, ju+t+i] 0 for 
all t = 0, 1, ..., m — tt + 1. Then obviously we have the 
following de-concatenation of (lu, A) 

iw,A) = (wl'^~\{(it,jt) u-1}) 

V (w” , {(it -iu + i,jt - iu + i) ■ t = u,.. .,m}). 

Clearly, the first marked word (wl“ \{(*i,Jt) : t = 
— 1}) G Aik- The second marked word 
(w” ,{(zt - in + i,jt - iu + i) ■ t = M, is 

a cluster since 

m 

\it iu A i, jt iu T 1] — [I 5 2z iu A 1] 

t—u 

by the choice of u. Hence we arrive at the following 
equality between two sets of marked words 

{{w,A) G Aik ■ max{jt : (it, jt) G A} = £{w)} 

= AikyTk. (70) 

Combining the case of null word and equations (69) and (70) 
proves the desired claim of (68). 

Using the decomposition in (68), we can rewrite (67) in 
terms of the three sets, i.e., the set for null word, Aik V F, 
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and Th- In particular, we have 

^ tt{w,W) 

(w,W)&MhyTk 

{a,A)^Aik 

(a,A)GA1fc (b,S)GTfc 

^ 7r(a,^)j I 7r(b,B)j. (71) 

^ia,A)(^Mk / \(b,B)erfc / 

Similarly, one can show that 

Y 7t{w,A) = 2z Y t^{w,A). (72) 

(iu,A)eAtfcVjF 2 (w,A)^Mh 

Substituting (71) and (72) into (67) gives 

f{z) = Y Aw, A) = 1 + 2zf{z) + f{z)T{z), 
(■w,A)eMh 

or equivalently, 

1 


/(^) = 


(73) 


i-2 z-r(z)’ 

where T{z) is the weight enumerator of elements in Tfe given 
by 

T{z) := Y Ab,B). (74) 

ib,B)GTh 


Determining T{z) is now relatively easy. Recall that the 
overlap function rfc(z) = where !(•) 

is the usual indicator function, shows exactly whether the 
length-(L — i) prefix of k is also a suffix of k. Let TZ^ = 
{i '■ 1 < i < L — l,rk{i) = 1}. For any cluster {b,B) € Tfc 
with b = bi.. .bn, we must have = fe by Definition 

6. So for any i € TZk, i.e., rk{i) = 1, we have = 

Hence the pair 

{bkL-i+i .. .kL,B U {(n + i - L + l,n + i)}) 


is a cluster in 7fc. It implies that for i G TZk, the set 

r — / {bkL-^+l^B^J{{n + i-L + l,n + i)}) : \ 

^'“’*•“1 {b,B)GTk,n = m I 

(75) 

is a subset of 7k ■ 

On the other hand, take any {b,B) G Tk with ^{b) = n 
and B — '■ t = l,...,m}, where 1 = ii < i 2 < 

■ ■ • < im < jm = n and im = n — L + m = \, then 
b = k and B = {(1,L)}. Hence we consider the case when 
m > 1. As {b, B) is a cluster, jm-i]n[im, jm] 7^ 0 and 

bAzl ~ bA — Therefore, we must have bA~^ = k\ = 
ki-v+i^ where v = j^-i - im + 1- Thus, rk{L - v) = I 
and {b,B) € Tk,L-v The above discussion then gives the 
following decomposition for Tk 


Tfc = {(fe,{(l,L)})}U U TfcJ . (76) 

For enumerating the weights of elements in Tk, we further 
claim that Tk,i H Tk,j = 0 for all i ^ j. This simply follows 


from the definition of Tk,i in (75) that for any (6, B) G Tk,i 
and {b',B') G Tkj, say B = {{it,jt) : f = 1,..., m} and 
B' = {{it,jt) ■ t = 1,...,to'}, where the pairs are 

arranged in ascending order, we have that jm — jm-i = i 
for B and jm' — jm'-i = j for B'. This proves our claim. 
Finally, using (76) and the fact that the sets {Tk,i} are disjoint, 
we obtain 

L-l 

T{z) = TT{k,{{l,L)}) + Yrk{i) Y ^(b,B) 

i=l (b,B)erfc,* 

ib,B)Gn 

L-l 

= -z^ - Y rk{i)z'T{z). 

i=l 

Hence 


^ ^ “ 1 , Y-i-l /At' 

1 + Li=l fkWZ^ 

Substituting the above into (73) proves (11) of Theorem 1. 


Appendix B 
Degree oe det (i - kkz) 

In this section, we will determine the degree of polynomial 
det (l — Afcz) that is required in the proof of Theorem 1. 

Proposition 10: Let kk be the adjacency matrix for the 
digraph Gk associated with UW k defined in Section III. Then 

degdet (I — Afcz) = L. (77) 


Proof: First, from (9) and (11), the two equivalent for¬ 
mulas for the enumeration of Sk,n, we see det (l — kkz) is 
divisible by hk{z) = (1 - 2z)(l -f BiLYkAz^) + z^. It 
follows that 

deg det (I — kkz) > L. 


To establish the converse of the above inequality, i.e., 
deg det (I — kkz) < L, it suffices to show that rank(A^^^) < 
L, which in turns implies rank(Aj^) < L. As a result, 
the algebraic multiplicity of eigenvalue 0 for kk is at least 
2^~^ — L. Hence, the degree of det(l — kkz) is at most L. 

To prove the claim, given the UW k = ki... kL of length 
L and the corresponding adjacency matrix kk for digraph Gk, 
let 

n^kk + e^Tl^ 

where ki = and k 2 = k^ ™d where by G 

with d = di... dL-i G we mean (e^)j+i = 1 if j has 

the binary representation d, and = 0, otherwise. 

Apparently, H is the adjacency matrix for the digraph 
without UW forbidden constraint and is therefore independent 
of the choice of k. As an example, if L = 3, then 


1 1 
0 0 
1 1 
0 0 


0 

1 

0 

1 


0 

1 

0 

1 
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Furthermore, it can be easily verified that ^ = 11^ is the Lemma 1: For any two length-n codewords a,b £ Ck{n), 
all-one matrix. Armed with the above, we now have we have and only if. 


L-2 

= - E (SfeiefcJ AL, (78) 

i^O 

where the last equality is due to the following identity for 
square matrices A and B: 


L-2 

B)\ 

i=0 

Applying the standard rank inequality of rank(A + B) < 
rank(A) -l-rank(i3) [19] to (78) yields 

rank(A^-i) 

L-2 

< rank rank Al) 

i^O 

L-2 

= i + Ei = L 

i=0 

and the proof is completed. ■ 


Appendix C 

Verification of Algorithms 5 and 6 

For completeness, we verify Algorithms 5 and 6 in this 
section. 

For message ui, i.e., the most likely message, we have 
from line 1 in Algorithm 5 that m = 1 and n = 0 since 
Ffc 0 = Cfc,o = 1- This results in the encoding output of the 
null codeword. In parallel, when receiving the null codeword, 
we have n = 0. Algorithm 6 then sets m = 1 at line 2 as 
Fk.-i = 0. This verifies the correctness of Algorithms 5 and 
6 for message ui. 

For m > 2, we shall show that for each n > 1, the encoding 
function (pk is a bijection between Uk{n) = {um '■ Fk,n-i < 
m < Ffc,n} and Ck{n), and the decoding function tpk is the 
functional inverse of pk- Equivalently, it suffices to show that 

1) t/’fc is a bijection between Ck{n) and Uk{n) for each 
n > 1, and 

2) pk is the functional inverse of tpk 
We will proceed with this approach. 

Prior to establishing the claims, we first introduce below a 
well-ordering of binary sequences. This is in fact a key concept 
embedded in Algorithms 5 and 6. 

Definition 8 (Lexicographical ordering): For any two bi¬ 
nary sequences a = ai... Oi and b = bi ... bj, we say a y b 
if i > j, or if i = j and there exists a smallest integer s, 
1 < s < i, such that a„ = for u = 1,..., s — 1, Ug = 1, 
and bs = 0. 


n n 

\Ck{a\-\n)\ \Ck{b\-^0,n)\. (79) 

i=l i=l 

Proof: As £{a) = £{b) and a y b, there exists a smallest 
integer s, 1 < s < n, such that a„ = 6„ for u = 1,..., s — 1, 
Os = 1, and bs = 0. Thus, 

n 

|Cfc(aL^0,n)| 

s-1 

- |<7fc(ar^0,n)| -f \Ck{al~'^0,n)\ 

s— 1 n 

>J2a^\Ckia\-\n)\+ E h\Ck{ar^0Kp.\0,n)\ 

2=1 2 = S +1 

n 

= J2h\Ck{b\-\n)\, 

i=l 

where the second inequality follows from the fact that the sets 
^)’ where i = s -f 1,... ,n and bi = 1, are 
disjoint proper subsets of Ck{al~^0,n). ■ 

With the above lemma, given a codeword c = ci... c„. 
Algorithm 6 outputs ipkic) = m with 


n / i—1 \ 



( IlAfc.Cj 


y^k T Tfc,n-1 + 1 

2=1 

Vi=i 

/ 


n 

E^* \Ck{c\~'^0,n)\ 

+ Fk^n-l + 1- 

(80) 

remark that the 

first term in 

the above, i.e.. 


Lr=i n)|, is the only term dependent on c, and it 

also appears in (79). It means that the encoding and decoding 
algorithms of UDOOC given in Algorithms 5 and 6 are indeed 
based on the lexicographical ordering of length-n codewords 
in Cfc(n). Using Lemma 1 we can establish the range of -pk 
when restricted to Ck{n). 

Corollary 4: The range of ipk when restricted to Ck{n) is 
the set Uk{n) = {um : Ffc,n-i < m < Fk,n}- Therefore, -pk 
is a bijection between Ck{n) and Uk{n) for all n > 1. 

Proof: Given Ck{n), let b be the smallest member and 
d be the largest member according to the lexicographical 
ordering, i.e. b ^ c ^ d for all c € Ck{n). It then follows 
from Lemma 1 that 

min pk{c) = pk{b) and max t/>fc(c) = -pkid). 

ceCfc(n) ceCh(n) 

Lor the minimum, from (80) we have 

n 

'Pk{b) = ''^^bi\Ck{b\ ^0, n)|-f -f 1. 

i=l 


Obviously, such ordering is a total-ordering of binary 
sequences. How the lexicographical ordering of binary se¬ 
quences plays a key role in the encoding and decoding of 
UDOOCs is due to the following lemma. 


Since b is the smallest member, it follows that for all i, i = 
1,... ,n, \Ck{b\~^0,n)\ = 0 if 5^ = 1. Hence 

min V'fc(c) = -pkib) = Fk,n-i + f- 

ceCfc(n) 
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To see the maximum, again from (80) 

n 

'>Pk{d) = dj \Ck{d\ ^0, n)| ++ 1. 

Since d is the largest member in Ck{n), the sets Ck{d\~^0, n), 
where i = 1,... ,n and di = 1, are disjoint and proper subsets 
of Ck{n). Moreover, for any c G Ck{n) and c < d, there 
exists a smallest integer s, 1 < s < n, such that = c„ for 
u = 1,..., s — 1, ds = 1, and Cs = 0. This in turn implies 
c G Ck{dl~^0,n). Therefore, 

n 

U Ck{d\-^0,n) = Ck{n)\{d} 

i = l 
di = l 

and 

‘4^k{d) = Ck^n 1 “t” d^k,n—l T 1 = Fk,n- 

Finally, noting that \Ck{n) \ = \Uk{n)\ and that 'ipk is injective 
by Lemma 1, we conclude that tpk is bijective. ■ 

So far we have established the first claim that xjjk is a 
bijection between Ck{n) and Uk{n). To prove the second claim 
that (pk is the functional inverse of ipk, given a codeword 
c = Cl ... c„. Algorithm 6 outputs 

n 

m = ipk{c) = '^Ci |Cfe(ci“^0,n)| + Fk,n-i + 1 

and Fk.n-i < m < Fk^n- Line 2 of Algorithm 5 would 
produce the correct n for m. Then, from line 3 of Algorithm 
5, we get 

n 

P 0 = '^c^ |Cfe(ci“^0,n)| + 1. 

i=l 

For the loop of lines 3-10 of Algorithm 5, when i = 1, dummy 
has value 

dummy = ^kkfik'l'^^~‘^y_^, = \Ck{Q,n)\. 


for the next iteration. Now suppose we are at the tth iteration 
of Algorithm 5 for some integer t with 1 < f < n. We have 
already determined ci, C2 ,..., Ct-i, and have 

n 

Pt-i = '^c^ \Ckic\~^0,n)\ + 1. 

i—t 

Line 4 of Algorithm 5 then gives 

/t-i 

dummy = j 

\i=i 

= |Cfc(ci“^0,n) 

Using the same reasoning as the above it can be easily shown 
that lines 5-9 of Algorithm 5 always produce the correct value 
for ct- Finally at the nth iteration we have 

Pn—l — Cn |Lfc(Ci 0,n)| -f 1 

and 

( n-l 

n Afc Ci 

i=i 

It should be noted that c”“^0 is a length-n word, hence 
dummy = 0 or 1. We distinguish the following cases: 

1) If dummy = 0, then Ci“^0 cannot be a valid codeword 
for UDOOC. Lines 5-9 of Algorithm 5 achieve exactly 
the above, since we have 

Pn-i = c„ • dummy -f 1 = 1 > dummy = 0 

and the algorithm always outputs c„ = 1. 

2) If dummy = 1, then p„_i = c„ -f 1. The same reasoning 
as the above shows that lines 5-9 of Algorithm 5 always 
produce the correct value for c„. 

We therefore complete the proof that (pk is the functional 
inverse of ipk- 


^ = |Cfc(c” ^0,n) 


^k,0^k 


{n-\-L — l) — t 


Vk 


We distinguish two cases: 

1) if Cl = 0, then we must have 

n 

Po = E Ci |Cfc( 0 c 2 ^0,n)| -f 1 < dummy 

i=2 

since |Lfc( 0 c 2 ~^ 0 , n) | is the sum of the cardinal¬ 

ities of certain disjoint subsets (with different prefixes) 
of Cfc(0,n). Hence lines 5-9 of Algorithm 5 output 
Cl = 0 as desired. 

2) if Cl = 1, then 

n 

Po = |Cfc(0, n)| -I- ^ Ci |Cfc(lc2“^0, n)| -I- 1 > dummy 

i=2 

and lines 5-9 of Algorithm 5 gives the correct ci = 1. 
Furthermore, it can be seen that at the end of line 9, we have 

n 

Pi = ^Ci |Cfc(ci“^0,n)| -I- 1 

i=2 


Appendix D 

limt^oo Lk^t for all-zero UW and unieorm 

I.I.D. SOURCE 


Let a = 00 ... 0 be the all-zero UW of length L. From (23), 
(54), and (55), it can be easily verified that 


Ca,nZ^ — 1 -f 

n—0 




( 81 ) 


Furthermore, from (23) we have g(z) = (1 — z)ha{z) — 
1 — 2z + It is straightforward to show that the two 

polynomials g{z) and j^giz) are co-prime to each other; 
hence there are no repeating zeros in ha{z). It then implies 
that all the nonzero eigenvalues of A^ are simple. 

Denote by Ai • • • the nonzero eigenvalues of A^, and 
assume without loss of generality that |Ai| > IA 2 I > ••• > 
|Al|. Then 

L 

Ca.n — T ^ ^ 0,i(Ai) 
i=l 
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where ai---aL are constants such that (81) holds. We can 
also obtain the closed-form expression for Fa^n as 


Fa^n — 1 + < 


A^+1 - 1 
A.-l ’ 


for all n > 0 


Consider a uniform i.i.d. source U of alphabet size M with 
M > 1. Let be the grouped source obtained by grouping 
any t source symbols (with repetition) inU. It is clear that U* 
is also a uniform i.i.d. source. The per-letter average codeword 
length is given by 


La,t=J L + y^Pi£{4>{ui)) 


M' 


1 / 1 " \ 


Let N be the smallest integer such that Fa^N > M*' > 
Fa,N-i- Then 


La,t ^ 


L + 


1 


M' 






N-l 


>- L 


> 


F, 


%,N ^ 


logM(T'a.Ar) 


4 1 tV-1 \ 

F + — ^ i ■ Ca,i I . 

, Fa,N “ / 


Consequently, 
lim L„ t 


N-l 


> lim 


L -|- 


AT-foo \og]^f (Fa, n) I Fa,N 




= lim 


E l\ - 

i= 


N-l . 


Af->oo Fa,N (Fa,„) 


= lim - 

N—¥oo 


E' 

2=1 


Aa(jV - l)Af - iVAf-^ + 1] 

(A. - 1)2 


7 V- Af+i-1 

1 + E”^ X.-l jlogM[l + 2 ^ai 


A. - 1 


_ 1 _ 1 
logM(Al) logM(ga)' 

This implies 

log2(M) HiU) 


log^iga) log2(5a) 
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